Numbers serve to discipline rhetoric. Without them

advertisement
Numbers serve to discipline rhetoric. Without them
it is too easy to follow flights of fancy, to ignore the
world as it is and to remold it nearer the heart's
desire
Ralph Waldo Emerson
Imagine:
• If every number came with its units, uncertainty, and exact
meaning all attached in a “metanumber”:
• a universal standard for all numerical data,
• Instantly readable by both humans and computers,
• And there was a simple unique name for every number,
• Which name could be used as a variable name in expressions,
• With all dimensional & error analysis automatically performed
• And all computations tracked allowing reuse of results
• Supporting AI, Big Data, and automated data exchange.
And Also Imagine :
• An algorithm continuously scraping all numerical data from
the web and converting it to standardized MetaNumber tables
• And with a novel process each table is converted to a network
• Among entities (rows) and among properties (columns)
• From which the dominant clusters are extracted and ordered
• Using a novel agnostic cluster identification algorithm
• Then with still another algorithm the dominant clusters are
linked into a single supernet spanning all numeric clusters.
• A single network spanning our entire numerical universe
MetaNumber
A Standardization of
Numerical Information
Joseph E. Johnson, PhD
jjohnson@sc.edu
September 24, 2015
The Problem
• Numerical Data is incomplete without
•
•
•
•
Value (the number itself)
Units (its dimensional basis)
Accuracy (its numerical uncertainty)
Defining Metadata (its exact meaning)
• But Published Data Cannot be Read by Computers Because:
• Units, Uncertainty, and Defining Metadata Indices are
• Non-standard, Scattered, Buried in Text, & in Multiple Locations
• Consequently:
• Data must be preprocessed by humans prior to every transfer
• A costly, time consuming, error prone process
• Data exchange is tedious and AI is thwarted esp. for Big Data
Proposed Solution
• Design a standard for all numerical data that tightly links
• Value - Units – Accuracy – Meaning (metadata)
• We call this structure a MetaNumber (MN)
• Design Algorithms to automatically evaluate MN expressions
•
•
•
•
Numerical results computed &
Full dimensional analysis of units with error trapping &
Automatic computation of numerical accuracy (uncertainty) &
Tracking of all associated metadata
• MNs must be instantly readable by humans and computers
Other Requirements
•
•
•
•
•
•
•
•
•
Use the SI metric system as the foundation
Allow unlimited attached metadata without encumbrance
Be sufficiently flexible to encompass existing standards
Support computational collaboration
Be compatible with current standards in computer coding
Track the complete computational evolution of each number
Allow extensions to the SI base units & user defined units
Design as a function that can be included in other systems.
Results must be expressible in any desired units.
Numerical Accuracy:
• If there is no decimal then the value is treated as exact
• Change 5.43e4 to 543e2 to make the value exact.
• If there is a decimal then it is converted to a “ufloat”
• 5.43e4 becomes 5.43+/-0.01e4 automatically
• Accuracy is based upon the number of significant digits
• Unit conversion values and internal constants are kept as exact
• Keyed MNs and table values are treated as stated
•
•
•
•
>4+5
> 4.0 + 5
> 345e-2 * 88
> 3.45*88
>> 9
>> 9.00+/-0.10
>> 303.6
>> 303.6+/-0.9
• If the uncertainty value is not needed, ignore it, use the value
• But the value as listed represents the result.
Units - 1
• Are variable names in mathematical expressions:
• > 3*kg + 2.4*lb
>> 4.09+/-0.05*kg
• Are single names: ouncetroy not troy ounce.
• > 4*ouncetroy
>> 0.1244139072*kg
• Are lower case with only a…z, 0-9 characters, “_” (interior):
• mile, mach, c, m3, s_2, gallon
• No information in fonts, or superscripts or subscripts
• Not m2 but m2
Not s-2 but s_2
Units – 2
• No Greek, other alphabetic or special symbols:
• ohm not W,
hb not ħ,
pi not π.
• Only lower case for all characters:
• kelvin or k not Kelvin,
ampere or a not Ampere
• Use singular spelling and not plural:
• foot not feet,
mile not miles,
gallon not gallons
• Are all defined in terms of the basic SI units:
• meter, kilogram, second, ampere, kelvin, candela
• Or abbreviated as m, kg, s, a, k, cd
Units - 3
• All prefixes are separate words, compounded any way:
• >micro*thousand*nano*dozen*ft
>> 3.6576e-12*m
• Allows metric, English, and common dimensionless prefixes
• All results are in SI (metric) units by default
• > 4*m * 8*m
>> 32.0*m2
• Results in other units using: (expression) ! (units desired):
• > 4*m * 8*m ! acre >> 0.00790734057749*(acre)
• c ! (mile/week)
>> 1.12663593737e+11*((mile/week))
• Dimensional errors are trapped: 3*kg + 4*m gives error
Units – 4 - Past Results (history)
• Results are sequentially numbered for each user, 1,2,…
• Past results cannot be erased
• A user can use a past result as hist(18) or [my_18] as:
• > 2* [my_18]
• Any past input expression can be retrieved as [my_18]
• The history file for a users can be retrieved as
• history 153 20 begins with value 20 and provides 153 values
• If history number offset has offset missing, then 10 is default
• Past results of each user are archived in a table as:
• username, seq#, datetime, input string, result, uid, subject
• The optional subject is set with {!subject = any name}
Units 5 - Optional Base Units
•
•
•
•
•
•
•
•
Six new units are optionally available:
bit, b;
a binary bit of information: 1/0 or T/F
person, p;
a unit of a living human
dollar, usd, d;
a unit of the US dollar
flop; op;
a unit of one floating point operation
bn;
baryon number (conserved like charge)
ln;
lepton number (conserved like charge)
These allow a vast extension of the SI system in the economic and
social sciences:
•
•
•
•
Population density:
Income:
Information flow rate:
Processing speed:
• Where flops = flop/s
45.2*p/hectare
34.7e3*d/(p*year)
4.2*giga*bit*s_1
3.8*tera*flops
Metadata - 1
• Any MN can be multiplied by {var1=val1 | var2=val2|…}:
•
•
•
•
6.4533e-4*kg*m_3*{lon=44*deg|lat=81*deg|depth=34.3*m}
The expression {…anything at all….} always evaluates to “1”
Thus {…} is a container for information attached to a MN
{.} or {.|reason = div by zero} to indicate a missing value
• Input expressions containing missing values results in a missing value
• {?} or {?|possible read error for x1} indicates a questionable value
• Computation proceeds normally but {?} is attached to the result
• Commands can be given in the form:
• {!subject = value} to set a subject for a user
• {!subject = } to remove a subject
• {!subject = ?} to determine the current subject
MetaNumber Archived Values
• MetaNumbers are stored in tables on the default server:
• The table dimension can be 0, 1, 2, …
• The table, row, and column have unique names for MN values
• A value is accessed as :
•
•
•
•
[table name_ row name _ column name…]
This gives a path to indicate the exact location of the metanumber
Thus each archived MN has a unique name
For example: [e_gold_thermal conductivity] will be replaced by the
thermal conductivity of gold from the elements table (e)
• MetaNumbers stored at any other location are indicated by:
•
•
•
•
[internet path_dir__table name_ row name _ column name…]
This allows the unique path name to uniquely name every MN
This expression [ …] can be used as a variable in any expression
Thus [e_lead_density]/[e_gold_density] gives the density ratio
• 0.58822+/-0.00006
Documentation
• An input of:
• # AnyString, is not processed as a an expression
• It is saved as a comment string as a result for that seq#
• This allows one to document ones work and explain the process
• expression ! units desired gives the result in units desired
• >16*ft + 38*inch - 1.2*m ! ft
>> 15.23+/-0.33*(ft)
• Metric prefixes can be used in any order
• > 456* billion*kilo* dozen*giga* million >> 5.472e+30
• > A farmer has 382.6 acres and needed 1.8 additional
inches of rain. How many gallons would this be?
• > 382.6*acre*1.8*inch ! gallon >> (1.87+/-0.10)e+07*(gallon)
• Here one simply multiplies the area times the height to get the
volume. Units are automatically managed
• What is the gravitational attraction between a 222.3 lb
man and a 85.1 kg golf cart 6 ft away?
• Use F = G m1 * m2 / d 2 to get the force in newtons.
• Note that the constant G is given by “g”
• >g*222.3*lb*85.1*kg/((6*ft)**2)
• >> (1.7124+/-0.0022)e-07*m*kg*s_2
• A sports car accelerates from 0 to 60 in 2.1 seconds.
How many “ g’s “ will this give (ag is acceleration due to
gravity)
• > ((60*mile/hour)/(2.1*s)) ! ag
• >> 1.30+/-0.06*(ag)
Working with MN Tables Suggests Something:
1. Think of MN tables as entities vs properties (elements)
1.
Some entities are more like others because their properties are similar.
2. Can we create an entity network from the table?
1.
Create an “entity Cij” that reflects this closeness
3. Then use our network clustering algorithm
1.
4.
Seek clustering among the entities, even among properties
Let me describe this procedure
1.
First a few preliminaries…
20
Math Preliminaries
Vector Spaces Lie groups and algebras, Markov Transformations
• Linear Vector Space (LVS): |A>+|B> = |C> and a|A> = |B>
• A basis |i> for the space gives all elements: |A> = ai |i>
• LVS becomes more powerful with the following two
products:
• Metric Space (MS) A*B = SAiBi = real # = |A| |B| cos q
• So we get the ‘metrics of length and angle.
• Examples: regular space and the unitary scalar product
21
Lie Algebras & Lie Groups
•
Lie Algebra (LA) [Li, Lj] = cijk Lk & the Jacobi Identity
• Lie Group : G(q) = eqL = 1 +tL + t2L2/2!+…
• LA Examples:
•
•
•
•
•
•
•
•
•
Rotation Sxi2,
Translation xi=xi+ai,
Lorentz c2t2-r2,
Poincare (Lorentz with translations),
Unitary SYa*Ya=1,
General Linear xi = Gij xj,
Markov Sxi = Sxi,
Scaling xi=elixi &
Heisenberg X, P, I.
LA Representations: LA represented by matrices
acting upon vectors in a metric space.
22
Information & Entropy
• Information is Defined by probability –
• If the probability is large at a place then we know where
something is and thus have more information.
• Information (order) is the inverse of Entropy (disorder)
• Information is additive but probabilities are multiplicative.
• Thus I is a log of P
•
Shannon I = S Pi log2 Pi
• Renyi’ I = log2 (2 S Pi2) or generally:
•
•
1 / 0 state: I = log2 2(12+ 02) = log2 2 = 1
½ / ½ state: I =log2 2 ((1/2)2 + (1/2)2 ) = log2 1 = 0
• S(a) = (1/(1-a)) log2 (S Pia) which gives Renyi entropy of order “a”
• Information is measured in 1 & 0 “bits”,
23
Markov Transformations
• Markov Transformations have, for 100 years, described
diffusion e.g. a random walk.
• Markov transformations preserve the sum of the components
of a vector
• The vector normally gives the probabilities for the components.
• Diffusion & entropy are described by Markov Transformations
• X’= Mx with Sx’i = Sxi
• These are motions over a plane perpendicular to the vector (1,
1, 1, ….1) and in the positive quadrant.
• This follows from <1|xi> =1 (is invariant)
24
Abelian Scaling Algebra A(n)
25
Decomposition of the General Linear Group
• GL(n,R) contains all the previous Lie algebras (LA) & groups:
• We discovered a new decomposition into:
• a Markov Type (MT(n2-n) ) LA
• A general element in the Lie Algebra is L = a12L12 +a21L21
26
More Specifically:
• Markov Type LA constrained to Markov Transformations gives
the Markov Monoid MM
• This connects Lie Algebras and Lie Groups with Diffusion, Random
walks, Entropy and Information
• Here we see the identity for l =0 and equilibrium with total
diffusion at l = infinity
27
The Markov Monoid
• In order to preserve the positive definiteness of the new
components, it is necessary and sufficient that the l
parameters are all non-negative.
• This removes the inverse of the Markov type transformation and
leaves us with a Lie Monoid.
• Note that the diagonal elements are necessarily the exact
negative of the sum of the off-diagonal elements in each
column.
• The L basis are also called Lagrange matrices.
• Thus we have linked Lie algebras and Lie groups with
continuous Markov transformations
• This allows the power of each domain of mathematics to inform
the other.
28
Networks
29
Networks are Isomorphic
to Markov Monoids
• A network is a set of points called nodes numbered 1,2,..,n
• which are connected with some ‘strength’ Cij
• which is an n x n matrix of non-negative numbers.
• However the diagonal is missing, not zero – missing!
• There is no meaning for the connection of a thing to itself.
• Thus a network C will exactly define the Lie monoid generator
• With the off-diagonal elements and conversely,
• If one defines the C diagonals as the neg sum of that column
• Thus every network C is isomorphic to a Markov monoid
generator L
• And to a family of Markov transformations; M(a) = exp(a L).
30
A Static Network Generates Markov
Flows and Provides a Model
• Now given any C we can construct the Markov Monoid L and
generate the continuous Markov transformation M = exp(t L)
• This generates a series of flows among the nodes in proportion to
their network connectivity.
• The Markov transformation has columns which are nonnegative and sum to unity.
• Thus they support a definition of Renyi entropy for each node.
• The entropy values can be sorted to give order to the nodes
• Representing the topology uniquely by an entropy spectral curve.
31
Our Previous Results
• The isomorphism of every network to a Lie Markov monoid
now integrates three branches of mathematics:
• Lie algebras and groups --- Markov theory --- Networks
• We were able to show that the Renyi entropy spectral curve
revealed aberrant changes in a network topology thus
showing changes over time (attacks, failures) and allowing a
comparison of two different networks with a metric for the
distance between them (scalar product…)
32
Expansion of the Topology
as a Series
1. We seek to expand a network as a series of terms
of decreasing importance.
2. We realized that each of the Renyi entropies was
a higher power of the Markov column
components
1. These are all functionally independent.
3. Furthermore each entropy spectral curve was
smaller than the preceding entropy curve.
33
Differences of Successive Orders of
Renyi Entropies
• Each spectral curve is less than the previous,
• Thus differences between two successive orders becomes smaller
very rapidly .
• Thus this sequence of curves constitutes a rapidly decreasing
series of curves that contain all of the topological information
of the network.
• It serves as the expansion that we sought: complete and rapidly
decreasing.
• This expansion into differences between successive Renyi
entropy sorted spectral curves provides a complete series
expansion for the topology.
34
• The same process can be executed on the rows.
• As there are n*(n-1) independent values in the network
topology
• Since each column and row collectively provide 2n, this requires
(n-1)/2 orders of Renyi entropies.
• These curves can be used to monitor anomalous changes in a
network:
• Attacks, failures, and the differences between any two networks.
• It can also track successive differences over time in a single
network and thus track its time evolution
35
Eigenvectors Contain the Cluster
Information
• Consider the eigenvector / eigenvalue solutions for the
Markov transformation
• The Markov transformation gives a model of the conserved flows.
• We realized that flows were faster within clusters in approach
to equilibrium but slower to the remaining nodes which are
weakly joined.
• We also realized that the eigenvectors described those
combinations of nodes that approached equilibrium at the
exact rate of the associated eigenvalue.
• Like normal modes of oscillation in physics.
36
• Thus the entire cluster analysis is displayed in the eigenvectors
• Each eigenvector is a liner combination of the component nodes
• Furthermore, the strength of the clustering is collectively
measured by the magnitude of the eigenvalue.
• The eigenvalue gives an order and name to the clusters
• The exp(tL) expansion is Markovian for any number of terms
• The number of terms gives the degree of separation.
• The eigenvector structure (and thus the cluster information)
are slightly different for the different numbers of terms used
for the exp(tL) expansion.
37
Network Classification – a beginning
• Thus the entire network topology is captured in both methods
of network analysis:
1. The (n-1)/2 successively smaller nested Renyi entropy
spectral curves contain all of the topology in increasingly less
important terms.
1.
2.
These measure the nth order entropy of incoming and outgoing
connections for each node.
The n eigenvectors of the Markov matrix, each with n
components, sorted by the order of the eigenvalues as a
metric of the degree of clustering, likewise provides a
complete description of the topology.
1.
These measure the clustering structure in the network.
38
Applications to Simulated
Networks
• We created a network with clusters within clusters by using
larger values of the connection matrix in successive layers and
scrambled the nodes.
• We computed the eigenvalue and eigenvectors which instantly
revealed the anticipated cluster structure.
• As there is no formal definition of clustering, there is no way
to prove this hypothesis but our observations confirm what
one intuitively can see by the solution as well as by the
underlying Markov model.
39
A network that has a cluster embedded within
a cluster gives these eigenvalues:
•
•
•
•
•
•
•
•
Eigenvalue: 0.882491186839: Included nodes: [0, 1, 2]
Eigenvalue: 0.882491186839: Included nodes: [1, 2]
Eigenvalue: 0.926438963526: Included nodes: [0, 1, 2, 3, 4]
Eigenvalue: 0.926556991774: Included nodes: [0, 1, 2, 4]
Eigenvalue: 0.955737589687: Included nodes: [5, 6, 7]
Eigenvalue: 0.955934195065: Included nodes: [5, 7]
Eigenvalue: 0.970549538208: Included nodes: [8, 9]
Eigenvalue: 0.999846933025: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8,
9]
• Eigenvalue: 0.999953415037: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8,
9]
• Eigenvalue: 1.0: Included nodes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
40
Show C and its Eigenvectors:
10-Node Cluster Network: 3 clusters
41
Testing the procedure
• A threshold value is set to make it easier to see the strongest
connections
• Every element below a certain percentage of the max value is set
to 0
• This example uses a table of 103 elements with 21 physical
properties
• The strongest 3% of the connections are shown (from
strongest to weakest: blue, green, and red)
Generate Networks from MetaNumber
Tables and Seek Clusters
43
Clustering of Tabular Data for Entities
verses their Properties
• Consider a standardized MetaNumber table
• Rows of entities like elements. columns that numerically measure
their properties
• Some entities are “closer to other entities” in the sense that
their spectra of properties are more similar.
44
Calculating the elements of Cij
• The mean and standard deviation of each property is
computed (all of which have the same units), such as density.
• Each property value is rewritten in terms of the number of
standard deviations from the mean thus giving a table Tij
• One then defines a network C as
• Cjk = exp- Si(Tij – Tik)2
• One then computes the eigenvalues and eigenvectors of
• M = exp (tC) to identify clusters
Current Work
Create and Analyze the “Numerical Universe”
1. Build algorithm to standardize a CSV table automatically
2. Automatically scan web sites for numerical tables and
normalize them as metanumber standardized tables.
1.
2.
3.
3.
4.
5.
6.
7.
Identify the unique defining indices for rows and columns.
Identify the units for each value and attach it to the value
Execute the table normalization with means and std dev.
Compute the Renyi entropy spectral decompositions.
Perform Eigenvector/Eigenvalue analysis to identify clusters
Build a dashboard to monitor results
Archive the resulting tables for use with MN computations
Link clusters to form the ‘supernet’
46
Development Plan
1.
2.
3.
4.
Use MN with basic tables to do calculations
Explore Web Grid as a user tool with other APIs
Applications to instruction (AAPT presentation Oct 31, 2015)
Algorithm to take well formed tables into MN tables
1.
2.
3.
4.
Begin with well formed tables such as financial data tab
Use with E-books, and Reference Materials
Explore with Companies with their private data sources
Explore with Government agencies for limited internal work
5. Algorithm to Scrape Web to Create MN Tables
1.
Push as a means to create MN on a large scale
6. Explore Supernet as a research tool
1.
Along with cluster structures from tables
Our Appreciation to:
• Office of USC VP of Research
• for the ASPIRE I & II supporting grants
Faculty CoPIs – Phase I:
•
•
•
•
•
•
•
•
•
•
•
•
Dr. Don Jordan, Mathematics
Dr. William Hogue, USC CIO & VP
Dr. Phil Moore, Director RCI
Dr. Bob Mullen, Chair , Civil Engineering
Dr. Paul Huray, Electrical Engineering
Dr. Camelia Knapp, Geological Sciences and ESRI Director
Dr. Tammi Richardson, Biology
Dr. Dwayne Porter , Chair Public Health
Dr. Geoff Scott, New Chair, Public Health
Dr. Kendra Albright, Library and Information Science
Dr. Susan Rathbun-Grubb, Library and Information Science
Dr. Bert Ely, Biology & Director Center for Science Education
Faculty CoPIs - Phase II:
•
•
•
•
•
•
Dr. William Hogue, USC VP and CIO
Dr. Phil Moore, Director Research CyberInfrastructure (RCI)
Dr. John Rose, Computer Science & Engineering
Dr. Gabriel Terejanu, Computer Science & Engineering
Dr. Francisco Blanco-Silva, Mathematics
Dr. Kendra Albright, Library and Information Science
Student Associates
• Phase I
•
•
•
•
•
•
•
•
•
•
•
Christian Merchant
Jordan Bradshaw
Brian Flick
Ben Torkian
Jun Zhou
William Hannah
Ethan Anderson
Ryan Giannelli
Dan Ramage
Lisa Wicklifte
Ideen Ghorbani
• Phase II
•
•
•
•
•
•
Sara Chizari
Roy Myers
Samuel Watson
Nick Puig
Dylan Shields
Sourav Das
Thank you for your interest.
• www.metanumber.com
• www.asg.sc.edu
• Joseph E. Johnson, PhD
•
•
•
•
•
•
Distinguished Professor Emeritus
Department of Physics and Astronomy
Office: Room 405, Physical Science Center
Phone: 803-777-6431
University of South Carolina
Columbia, SC, 29208, USC
Web application powered by University of South Carolina’s RCI Group
Directed by Dr. Phil Moore
52
Download