A Proposed Numerical Data Standard With Automated Analytics Support

advertisement
A Proposed Numerical Data Standard
With Automated Analytics Support
Joseph E. Johnson, PhD
Department of Physics, University of South Carolina,
Columbia SC, 29208, USA
jjohnson@sc.edu
Abstract— (A) A standard is proposed for all numeric data
that tightly integrates (1) each numerical value with (2) its units,
(3) accuracy (uncertainty) level, and (4) defining metadata into a
new object called a metanumber with full mathematical
processing among such metanumber objects. In essentially all
current data tables, these four components are distributed and
often embedded in freeform text, in the title, references, row and
column headings, and multiple non-standard locations within
other text in the document. While usually readable by humans,
this lack of structure frustrates the computer reading of
electronic representations of data thus incurring increased costs,
delays, and errors with severe consequences for Big Data. We
propose this metanumber standard for published numeric data
in order to lay a foundation for the fully automated reading (esp.
Big Data) by intelligent agents with a framework that satisfies
twelve criteria. Our standard, meeting these requirements, has
been designed, programed, and is operational on a server in a
Python environment as a multiuser cloud application using any
internet linked device accessing standardized metanumber
tables. (B) Our current work is now to automatically create such
standardized metanumber data tables from existing web pages
using a data normalization algorithm which acts on web sites to
build the associated metanumber table assisted by an evolving
user interface when necessary. (C) Finally, as an example of how
intelligent algorithms can extract information from such a fully
standardized numerical information tables, we have applied the
advances we have made in the mathematical foundations and
analysis of networks including an associated agnostic cluster
analysis. This framework utilizes our proof that every network,
as defined by a connectivity (connection) matrix Cij, is isomorphic
to the Lie algebra (monoid) generators of all continuous (as well
as discrete) Markov transformations thereby linking the
extensive mathematical domains of Lie algebras and groups first
with Markov transformations, and now to network analysis and
then finally to our associated cluster analysis. We will describe
the current results of our latest work which builds two networks
from entity-property type standardized metanumber tables. One
network is built among the entities (rows) while the other is built
among the numerical properties (columns) of those entities. Thus
as metanumber standard tables are created, our algorithms
automatically create these two associated networks and then
compute and display the nodal clustering first among entities and
then separately among properties. We seek to fully automate this
entire process from web sites to metanumber standardized tables
to the associated networks to the resulting cluster analysis with
an automated intelligent analysis for data sets esp. Big Data.
Keywords—Units, Metadata, Uncertainty, Standards, Clusters.
I. INTRODUCTION
If one examines electronic data from most sources1,2,3, it
becomes obvious that the units and the exact definitions of the
values are hidden in text in multiple locations and with
multiple formats. While normally easily understood by human
intelligence, it is also obvious that computers are unable to
accurately infer the exact units for each value as well as their
difficulty in extracting the exact meaning of the value in an
unstructured environment. Furthermore, assuming that the
accuracies of the values are reflected in the number of
significant digits, then when values are processed by a
computer, the results are values with essentially as many digits
as the computer retains in single or double precision. Thus the
number of significant digits is lost unless they are captured
immediately upon reading and then retained correctly
throughout the execution of all mathematical operations.
Otherwise the accuracy of the results is obliterated with the
first operation. Big Data problems refers both to those data sets
of vast size which require extensive processing speed, storage,
and transmission rates, and often more so to the massive
problems incurred with the astronomical number of existing
tables when each is formatted and structured in diverse ways.
This demands human preprocessing to convert each table to the
users framework for subsequent computation thus frustrating
automation. In conclusion, the examination of most published
tabular data suggests that they are designed for space
compression and easy reading by humans. As space is now
rarely a problem, and as human preprocessing becomes
unacceptable, we need to achieve simultaneous human and
computer readability. We call this standardized string object a
“metanumber” as it is actually a string, tightly linking all four
information components. These are to be managed through all
mathematical operations to give new “metanumbers”
containing the correct units, accuracy, and metadata tags. Yet
to achieve standardization requires far more than just
establishing some structure. It requires (a) that one find
intermediate workable solutions to convert existing data to the
standard framework until such time that this standard becomes
accepted for publishing data, and (b) that such a system is in
keeping with current standards in computer science as well as
the domain disciplines.4,5 We have developed an unambiguous
standard but which is also very flexible in accommodating
existing standards in specialized subject domains. This
standard satisfies the following requirements. It is to (a) be
easily readable by both humans and computers, (b) be
consistent with international scientific units, the SI metric
foundation with automatic management of dimensional
analysis including other units used in specific subject domains,
(c) provide automatic computation and management of
numerical uncertainty, (d) support a defining metadata standard
for the unique representation each number and its exact
meaning, (e) allow the linkage of unlimited metadata qualifiers
and tags without the burden of an associated data transfer or
processing, and (f) be sufficiently flexible to evolve and
encompass existing domain standards that are currently in use.
The structure must also support (g) computational
collaboration and sharing among user groups, (h) be
compatible with current standards in computer coding, and (i)
provide an algorithm to convert, as best possible, current
nonstandard representations to this standard, until such time as
data is published in this standard. This standard should also
support (j) the analysis of a number’s complete historical
computational evolution, (k) optional extensions to the SI base
units, and (l) be configured as a module or function that can be
included in other systems. In the second half of this paper, we
will also demonstrate ways in which intelligent agents can
operate automatically upon such standardized data, utilizing
our recent work on the mathematical foundations of networks
and associated cluster identification algorithms.
This
application utilizes a means of converting metanumber tables
of entities with their properties into a form that allows
conversion to a network thus supporting an automated
procedure for cluster identification for each new standardized
metanumber table. Our cluster identification algorithm
provides an example of how this metanumber standard can
support advanced intelligent analytics.
II. STANDARDS DEFINED
A. A Numerical Accuracy and Uncertainty Standard
While some numbers are exact such as the number of people
in a room or the value (by definition) of the speed of light,
almost all measured values are limited in accuracy as is
represented by the number of significant digits shown. In a
very few cases one might know the standard deviation of the
normal probability distribution but such knowledge is not
generally available. The subject of the correct management of
numerical accuracy (uncertainty) is very complex because one
is actually replacing a real number with a probability
distribution which is a function. This function needs multiple
other real numbers to specify it unless one is confident that it
is a normal probability distribution, but even then such
functions do not close into other simple functions under all
mathematical operations. While the author’s research into this
subject led to a new class of number 6,7 (similar to one based
upon quantum qbits), that method is so complex that we do
not wish to utilize it in this initial standard framework. The
programing of the retention of the correct representation of
accuracy is also a complex task. Thus we were fortunate that
an existing Python module (python uncertainty)8 has been
developed that can be joined to our software. Because the
number of significant digits is always available, we have
chosen the Python Uncertainty module for the management of
numerical accuracy where our python functions convert a
given real into a new class of number called a “ufloat” which
retains both the numerical value (represented with correct
.
accuracy) along with the associated uncertainty which arises
from the limited number of digits. Thus it only remained for
us to encode methods for when and how to convert a value to
a ufloat or to keep it exact.
Our standard is to represent any exact value with no
decimal point present. This is achieved by adjusting the
exponent in scientific notation to remove the decimal. If a
value has a decimal point present then it is to be treated as
uncertain having the number of significant digits shown. Thus
5.248e5 will be converted automatically to a ufloat, but that
same value, if exact, would be entered as 5248e2 and treated as
an exact real number as there is no decimal. A second option
that is offered utilizes the upper case “E” for scientific notation
for exact numbers with the lower case “e” for uncertain values.
Our Python algorithm automatically converts all keyed and
data table values to the correct form as just described while all
unit conversions and internal constants are treated as exact
values. A virtue of this system is that no other explicit notation
is needed to encode numerical accuracy as it is automatically
executed as described based upon the presence or absence of a
decimal. Furthermore, the Python Uncertainty module also
supports other forms of uncertainty which can be invoked in
future releases of our metanumber software.
B. Units & Dimensional Analysis Standard
Our standard format for attaching units to a numerical
value is to represent all units by variable names9 which are to
be attached so that the expression is a valid mathematical
expression such as the velocity 4.69e2*m/s. When data is
expressed in tabular form, one has a choice of compressing the
notation by writing the units for an entire table, or row or
column at the top and then requiring the subsequent
multiplication to attach the units to the value to be executed at
the time of use. While this is the more common convention
and requires less storage space, it also requires more
processing time when each value is retrieved. As space is
rarely now a problem, we have chosen to adjoin the units
directly to each value thus increasing storage, reducing
processing time, and improving human readability. Naturally,
when units are homogeneous over an entire table, row or
column, then an alternative notation allows the units to be
adjoined at the time of retrieval with the result that a
metanumber standardized table in this format requires no more
space than otherwise required but increases the processing.
That method, offered as an option, labels a row or column as
“*units” which causes the units that follow in that row or
column (or even the table) at the time of retrieval to be
adjoined to the value.
The general rules for unit names are given as follows.
Units (1) are variable names which are parts of valid
mathematical expressions: 3*kg+2.4*lb. The conversions are
automatically executed by the metanumber software. (2) Are
single names such as “ouncetroy” not “troy ounce” since each
unit is a single variable name. (3) Are lower case with only
a…z, 0-9, and the underscore “_” character: e.g. mach, c, m3,
s_2, gallon, ampere, kelvin. This is in keeping with modern
software conventions which have variables in lower case and
where variable names are case sensitive. The convention of
lower case also removes the potential case ambiguity.
(4) Never have superscripts or subscripts or any information
contained in font changes: m3 not m3 . Such fonts are
sometimes inconsistently treated in different software
languages and thus no information may be represented by a
change in font. (5) Never use Greek or other characters or
special symbols: ohm not Ω, hb not ħ, pi not π. (6) Are defined
with nested sequential definitions in terms of the basic SI units:
m, kg, s, a, k, cd. The Python code defines each unit in terms
of previous units back to the basic SI standard. (7) Use singular
spelling and not plural: foot not feet, mile not miles. This is to
avoid spelling mistakes and to reduce the code size. (8)
Includes all dimensionless prefixes as separate words, to be
compounded in any order as needed: micro* thousand* nano*
dozen* ft. Note that the “*” or “/” must be present between
each unit or prefix. (9) Evaluation of expressions always
results in SI (metric) units as the default. To obtain the result
in any other units one uses the form (expression) ! (units
desired) e.g. c ! (mile/week). This form can be used to obtain a
result in any desired units. (10) Dimensional errors are
trapped: 3*kg + 4*m gives an error. (11) All unit conversion
values and internal constants are treated as exact numerical
values without uncertainty. (12) A user can use a past result
with the form: [my_423] where 423 is the sequence number of
a previous result. (13) Four new units are optionally available:
bit, or b for a bit of information (T/F, 1/0); person, or p for a
living human; dollar, usd, d for the US dollar; and flop (for a
single floating point operation). These units vastly extend the
dimensional analysis and clarity of the meaning of expressions
by allowing both information based concepts such as flops =
flop/s and baud = bit/s, as well as socioeconomic units of
p/acre (population density), and d/p (income as dollars per
person). (14) Users can define units in a jointly shared table
with [_unitname] which can be used like a unit or variable. It is
to be created by the command {! unitname = definition}. A
user can find the list of internal units on the web site
(www.metanumber.com) and one should familiarize oneself
with the spelling and available internal units especially those
with short names that are abbreviations. With the metanumber
system one can calculate the gravitational force between two
masses easily even when the values are expressed in diverse
units such as between 15 lbs, and 12.33 kg that are 18.8 inches
apart as: g*15*lb*12.33*kg/(18.8*inch)**2 using F=G m1
m2 /(r12)2.
C. The Metadata Framework Standard.
Standardized metanumbers are to be electronically published
in tables of different dimensions: zero dimensions for a single
value, one dimension for a list of values such as masses, and
two dimensions for a two dimensional table or array often
representing entities in rows with the properties of those
entities in columns. For one dimensional data the value can be
retrieved with the form: [mass_proton] where “mass” is the
name of the table on the default directory on the default server
and where “proton” is the unique name of the row given in
column one. For a two dimensional table one might write
[e_iron_thermal conductivity] which would retrieve the
thermal conductivity of iron in the elements table abbreviated
as “e”. Thus the general multidimensional form for retrieving
standardized metanumbers is [file name _ row name _ column
.
name]. For comparison, these names (like database indices)
must be unique. The program removes all white space and
lowers the case for each index string prior to comparison. If
the table is not in the default directory and server then the
metanumber is retrieved as [internet path to server_directory
path __ file name _ row name _ column name]. It is of great
importance to realize that with this standard that this
expression provides a unique name to every standardized
numerical value (metanumber) within a table with unique row
and column string indices. These expressions can be used as
variables in any mathematical expression such as 5.4 times the
ratio of copper density to zinc density as 5.4*
[e_copper_density]/[e_zinc_density] . For books the file name
might be the ISBN number followed by the path to that table
or value in the electronic document. This design allows an
automated conversion of all metanumber tables into a
relational database structure or SAS data sets. Our standard
however is a simple flat file in comma separated values (CSV)
form.
For a given table, there is metadata that applies to the
whole table such as values for the variables: table
abbreviation, name full name, data source (s), name of table
creator, email of table creator, date created, security codes
(optional), units (for all entries), and remark. These variable
names occupy the first row beginning in column two with
<MN> as occupying the cell at row one, column one to
identify the table as a valid metanumber standardized table.
The values for these variables are to be in the corresponding
columns in row two. This provides the flexibility for
additional variables. The format of the table is to be in comma
separated values (CSV) which is easily created or viewed in
spreadsheet software. Thus commas are not allowed.
Subsequent rows (3,4 …) that have a name in column one
preceded by a “%” such as %remarks indicates that this row
contains metadata applicable for the table as a whole and thus
the corresponding rows can contain anything.
The first row that begins with the symbol %% in
column one indicates that this row and subsequent values in
column one are to be the primary unique indices that can be
used to point to values. Metanumbers are those cells in the
corresponding matrix that do not have row or column index
names that begin with a “%” symbol as such names indicate
that the corresponding row or column contains only metadata
to be associated with that row or column. Row or column
names beginning with %% can be auxiliary unique index
names for the corresponding row or column such as using an
element symbol (with column labeled as %%symbol) or its
atomic number (%%atomic number) rather than its full name.
Supporting metadata can be located in rows or columns that
give web addresses for more extensive information, equations
for use, or video lectures or other information or metadata.
Thus in the element table, one could have a row of internet
links under the indices like density and thermal conductivity
that give additional information, equations, or properties. The
actual metanumber values are those which do not have a “%”
sign preceding the associated row and column index. As a
consequence of this design, the simple path to a metanumber
[table_row_column….] provides a universally unique name.
This form is used to retrieve the metanumber and use it in
expressions as a variable. This path name also denotes the
path that provides unlimited metadata associated with the
value indicated, without the transfer of that metadata. This
design allows unlimited metadata tags to be linked to
pharmaceuticals,
accounting
expenditures,
medical
procedures, scientific references, equipment or transportation
items and other conditions or environments such as reference
temperatures.
While the above framework supports metadata that
attaches to an entire table, row, or column of values, one often
must attach metadata to specific metanumbers such as
longitude, and latitude. This can be accomplished with the
form *{var1= val1 | var2 = val 2 |…} which can be multiplied
at the end of any metanumber to provide information specific
to that value such as *{lon=…|lat=…|time=…}. Of special
note is the missing value {.} or {.|reason = div by zero} and
the questionable value {?} or {?|reason = ….}. Any missing
value in an expression is to give a missing result when
evaluated. Any questionable value is to compute normally but
then {?} is attached to the result. When the expression
“{anything}” is encountered in a mathematical expression, it
is converted to ‘1’ thus 3.45*{anything} results in 3.45. Thus
the form {anything} is a container for attaching information in
the mathematics without altering the results.
Our metadata and metanumber standards are compliant
with NIST4 and USGS5 recommendations as well as the latest
new basis for the SI / metric standards that for the first time
will base all fundamental units upon fundamental constants10
D. Other Components of the MetaNumber Standard.
There is a single table that contains all actions of all
users containing the following variables: userid_seq#, date
time, input string, evaluated result, unit id# of result, optional
subject. The userid_seq# is a unique index for each line in the
archive. A user can access their own past archived file of
inputs and evaluated results for a seq# value or for a given
active subject. Further details can be found by entering “help”
which will give the procedures for the latest software release
or in the web user guides.
It is often important to document ones work. This can
be accomplished by entering the # symbol as the first symbol
on an input line and anything else can follow this symbol. This
is like the comment line for Python code and one can insert
documentation text for later use that describes ones
calculations. When the system sees the “#” symbol in the
first position, then all processing is bypassed and the
subsequent string is stored as the result in {# text
information…}.
A series of calculations can be collected under a
‘subject’ name by entering the line “{!subject = text}” then
the “subject” variable for that user is set to the “text”. The
subject name is retained until “{!subject= null}” or “{!subject
= }” is entered or until a user logs out. This enables one to
identify a set of calculations and remarks under a subject
heading for later output as an individual user. {!subject = ?}
will output the current subject. A user can also set a subject
and identify other users who are allowed to see and retrieve
the entries of other team members while they have the same
subject set. The instructions for this are under “help”.
Another feature of the metanumber system is a
metanumber value which is itself a mathematical expression.
For example the speed of sound in air is dependent upon the
operating temperature “To” in Kelvin which can be expressed
as (331.5+(0.600)*(To-273.15*k))*m*s_1. The use of such
expressions, when applicable, greatly enhances and
compresses the required storage.
III. UTILITY OF THE METANUMBER STANDARD FOR
AUTOMATED AGENTS
A. Networks and Clustering.
As an example of how automated intelligent agents
can operate unassisted in the analytical analysis of numerical
information, we will explore how we are developing such
agents to convert metanumber tables to networks11 and to then
find clusters within those networks. It is well known that
cluster analysis is one of the foundations of intelligent
reasoning. Our very language is built upon names for entities
that cluster in use, appearance or function. It is also the basis
for the classification of biological organisms and the names
for objects in our everyday life. Yet there is no generally
agreed upon definition of clustering from a mathematical or
algorithmic basis and there are over a hundred current
algorithms for identifying clusters. The following will give a
very brief overview of our research into the mathematical
foundations of networks and an associated agnostic algorithm
for cluster identification based upon an underlying Markov
model with a probability based proximity metric. We will
utilize our previous research with a means for converting two
dimensional metanumber standardized tables of entities (rows)
and their properties (columns) into a network for the
collection of entities and one for the properties. We will
provide the algorithm for finding clusters for both networks.
This algorithm has been built to run automatically on all
metanumber tables that are generated by the standardization
algorithm. We first give a brief overview of our past research.
Networks represent one of the most powerful means for
representing information as interrelationships (topologies)
among abstract objects called nodes (points) which are
identified by sequential integers 1, 2,..n. A network is defined
as a square (sparse) matrix (with the diagonal missing) that
consists of non-negative real numbers, and which normally is
very large, where the strength of connection between node i
and node j is Cij. The diagonal is missing because there is no
meaning for the connection of a thing to itself. The off
diagonal elements must be positive because there is no
meaning to a connection that has less than no connection at all.
The mathematical classification and analysis of the topologies
represented by such objects is one of the most challenging and
unsolved of all mathematical domains12. Cluster analysis13 on
these networks can uncover the nature of cohesive domains in
networks where nodes are more tightly connected. There is no
single agreed upon formal definition of a cluster.
B. Lie Algebras & Groups and Markov Monoids
This section will give a very brief overview of our
mathematical results that provide a new mathematical
foundation for network analysis and cluster identification.
These results are built upon other previous work the author
discovered in developing a new method of decomposing the
continuous general linear (Lie) group14 of (n x n)
transformations. That decomposition15 showed the general
linear Lie group to be the direct sum of two Lie groups: (1) a
Markov type Lie group (with n2-n parameters) and an Abelian
scaling group (with n parameters). Each group is generated as
is standard, by exponentiation of the associated (Markov or
scaling) Lie algebra. The Markov type generating Lie algebra
consists of linear combinations of the basis matrices that have a
“1” in each of the (n2-n ) off-diagonal positions with a “-1” in
the corresponding diagonal for that column, (an example of a
Lagrange matrix16). When exponentiated, the resulting matrix
M(a) = exp(a L) conserves the sum of elements of a vector
upon which it acts. By comparison, the Lie algebra and group
that preserve the sum of the squares of a vector’s components
is the familiar rotation group in n dimensions. But in this form
the Markov type transformation can transform a vector with
positive components into one with some negative components
(which is not allowed for a true Markov matrix). However, if
one restricts the linear combinations to only non-negative
values then we proved that one obtains all discrete and
continuous Markov transformations of that size. This links all
of Markov matrix theory to the theory of continuous Lie
groups and provides a foundation for both discrete and
continuous Markov transformations. There are a number of
other details not covered here. In particular, the number of
terms in the expansion of the exponential form of the Markov
Lie algebra matrix gives the number of degrees of separation.
C. Lie Algebras & Groups and Markov Monoids
Our next discovery17 was that every possible network (Cij )
corresponds to exactly one element of the Markov generating
Lie algebra (those with non-negative linear combinations) and
conversely, every such Markov Lie algebra generator
corresponds to exactly one network! Thus they are isomorphic
and one can now study all networks by studying the associated
Markov transformations, the generating Lie algebra, and
associated groups18,19. Our subsequent recent discoveries20,21
are that (a) all nodes in any network can be ordered by the
second order Renyi entropies of the associated columns (and
rows) in that Markov matrix thus representing the network by a
pair of Renyi entropy spectral curves and removing the
combinatorial problems that frustrates many areas of network
analytics. A second Renyi entropy spectral curve can be also
generated using the rows rather than the columns for diagonal
determination. Now one can both compare two networks (by
comparing their entropy spectral curves) as well as study the
change of a network’s topology over time. One can even
define the “distance” between two networks (as the
exponentiated negative of the sum of squares of differences of
the Renyi entropies or if there are changes in the number of
nodes, one uses a spline smoothing. We recently showed that
the entire topology of any network can be exactly represented
by the sequence of the necessary number of Renyi entropy
order differences. These converge very rapidly. This is similar
to the Fourier expansion of a function such as for a sound wave
where each order represents successively less important
information. Our next and equally important result was that an
agnostic (assumption free) identification of the n network
clusters is given by the eigenvectors for this Markov matrix.
This not only shows the degree of clustering of the nodes but
actually ranks the clusters using the corresponding eigenvalues
thus giving a name (the eigenvalue) to each cluster. A more
descriptive name can be created from the nodal names (indices)
associated with highest valued nodes (when sorted by
eigenvalue) and as given by the associated eigenvector. The
reasoning underlying the cluster identification is that the
Markov matrix generated by the altered connection matrix (Lie
algebra element) has eigenvalues that successively represent
the rate of approach to equilibrium for a conserved substance
that flows among the nodes in proportion to their degree of
connectivity. Thus network clusters result in a higher rate of
approach to equilibrium as measured by the associated
eigenvalue for flows among the nodes identified by the
eigenvector.
D. Metanumber Tables Define Netwroks and Clusters
Now let us recall the two dimensional metanumber
standardized storage of entities and properties of those
entities. For simplicity, let us just consider the table of
metanumber values with a single unique index label for the
row and another for the column. One example we have
explored is the table of the chemical elements (rows) versus
the 50 or so properties of those elements while another is the
nutrition table with 8,400 foods verses 56 numerical properties
of their nutritional contents. Another such table would be
personal medical records of persons verses their
approximately one hundred numerical metrics of blood, urine,
and other numerical parameters including multiple records for
each person identified by time. Another would be the table of
properties of pharmaceutical substances including all
inorganic and organic compounds, or even a corporation’s
manufactured items. Our most recent development is that one
can generate two different networks from such a table of
values Tij for entities (such as the chemical elements in rows)
with properties (such as density, boiling point) for each
element in a column (the columns have identical units). To do
this we first normalize the metanumber table by finding the
mean and standard deviation of each column and then rewrite
each value as the number of standard deviations away from
the norm which is set numerically to zero. Since the
dimensionality for each property has the same units, then this
process also removes the units that are present and the new
values are dimensionless. For example the density or thermal
conductivity column will have values with all the same
dimensionality. We then define a network Cij among the n
entities (here the elements listed in rows) as the
exponentiation of the negative of the sum of the squares of the
differences between Tik and Tjk thus Cij = exp-k(Tik - Tjk)2.
This gives a maximum connectivity of ‘1’ if the values are all
the same and a connection of ‘0’ if they are far apart as would
be expected for the definition of a network. This is similar to
the expectation that a measured value actually differs from the
true value by that degree. We can form a similar network
among the properties for that table. One then, as before,
adjusts the diagonal to be the negative of the sum of all terms
in that column to give a new C, forms the Markov matrix M(a)
= exp (aC) and finds the eigenvectors and eigenvalues for M
to reveal the associated clustering. The rationale for how this
works can be understood when the Markov matrix is viewed
as representing the dynamical flow of a conserved
hypothetical substance among the nodes, the result of its
action on a vector of non-negative values whose sum is
invariant. This methodology includes and generalizes the
known methodology with a Lagrange matrix.
We are currently exploring the clusters and network
analysis that can be generated from tables of standardized
metanumber values. We have done this for the elements table
which was very revealing as it displayed many of the known
similarities among elements. The cluster of iron, cobalt and
nickel was very clear as well as other standard clusters of
elements such as the halogens and actinides. We are currently
analyzing a table on the properties of pesticides and we are
also performing this analysis on a table of 56 nutrients for
8600 natural and processed foods. The study of clustering in
foods based upon their nutrient and chemical properties as
well as the clustering among the properties themselves will be
reported as available on the www.metanumber.com web site.
Our initial results show three of the dominant food-nutrition
clusters to be oils for cooking, nuts, and baby foods. Results of
this analysis are expected by the end of September of 2015.
The numerical standardization in terms of
metanumber tables lead to other more holistic networks which
can be built utilizing the metanumber structures with resultant
cluster analysis. Users (using the PIN#), can each be linked to
(a) unit id (UID) hash values of the results of their
calculations, as well as to (b) the table, row, and column
names of each value. These linkages can be supplemented
with linkages of each user to those universal constants that
occur in the expressions which they evaluate. The resulting
network links users to (a) concepts such as thermal
conductivity, (b) substances such as silver alloys, and (c) core
constants such as the Boltzmann constant, Planks constant, or
the neutron mass. The expansion of this network in different
powers, giving the different degrees of separation, can then
link users via their computational profiles, (the user i x user j
component of C) as well as linkages among substances,
metadata tags, and constants. The clustering revealed in
different levels of such expansions then can reveal groups of
users with linkages that are connected by common
computational concepts. Users working on particular domains
of pharmaceuticals and methodologies are thus identified as
clusters as well as groups of astrophysicists that are utilizing
certain data and models. At that same time, the clustering can
identify links among specific substances, models, and
methodologies. Our current research is exploring such
networks and clusters as the underlying metanumber usage
expands.
IV. CONCLUSTIONS
A standard for the integration of numerical values
along with their units, accuracy level, and exact unique path
name (termed a “metanumber”) has been proposed and
programed in python on a server open for use at
www.metanumber.com and compliant with the twelve
requirements listed. This standard is sufficiently defined to be
totally unambiguous, yet support sufficient subject domain
flexibility to accommodate other existing standards. Each
metanumber (a) value has (b) the units adjoined as
mathematical ‘variables’, (c) the representation of accuracy by
the presence of a decimal point, and (d) a unique metadata
path for every stored numerical value. This unique path name
for every numerical value not only defines that number
unambiguously, it also provides the link to unlimited
descriptive metadata tags. The permanent archiving of every
input and result for each user with sequence number, datetime, unit hash value, and optional subject, has the critical
feature that the exact mathematical structure, with all exactly
uniquely named metanumber variables, provides with this log,
the mathematical computational history and path for every
computed new number. The fact that the input expression is
stored with the result in this archival log, means all metadata
associated with each metanumber is traceable but not
encumbering to the system. Our current work indicates that
our new algorithm can effectively preprocess data from web
sites and reformat it in the metanumber standard (with
questionable units and indices flagged for the user). Our
parallel work can take an entity vs property table and
automatically generate two networks, one among the entities
as nodes and one among the properties as nodes. This then
allows the determination of clusters, named by the associated
eigenvalue and sorted by the magnitude of that eigenvalue as a
metric of the “tightness” of the cluster. Nodal names
associated with the highest weights in the eigenvector provide
descriptive metadata describing the identity and composition
of the cluster. The solid mathematical foundation of this
automated analysis of data structures uses the theory of Lie
algebras and groups, continuous and discrete Markov
transformations, network topology, cluster analysis that can
provide advanced analytics for still another stage of analysis.
That next stage now set for an analysis of the topological
networks among users, units, tables, primary constants,
unlimited linked metadata, and models. The collaboration
model, under a subject name, supports either an individual or
groups of users working with secure data-model sharing with
documentation. Finally, because the system supports full
archival tracing with numerical accuracy, it follows that one
can study the path dependence of information loss (similar to
the path dependence of non-conservative forces in physics).
The ability to automate cluster identification with associated
metadata linkages is a key component of an intelligent system.
But most important, this system supports the sharing of
standardized data instantly among computers for research
groups, corporations, and governmental (and military) data
analysis with fully automatic dimensional analysis, error
analysis, and unlimited metadata tracking.
REFERENCES
[1]
NIST
Fundamental
List
of
Physical
Constants
http://physics.nist.gov/cuu/Constants/Table/allascii.txt
[2]
Bureau of Economic Analysis, NIPA Data Section 7 CSV format,
http://www.bea.gov/national/nipaweb/DownSS2.asp
[3]
Statistical
Abstract
of
the
United
States
Table
945
http://www.census.gov/compendia/statab/cats/energy_utilities/
electricity.html
[4]
[5]
NIST Data Standards http://www.nist.gov/srd/nsrds.cfm
USGS Data Standards http://www.usgs.gov/datamanagement/plan/
datastandards.php
[6]
Johnson, Joseph E., 2006 Apparatus and Method for Handling Logical
and Numerical Uncertainty Utilizing Novel Underlying Precepts, US
Patent 6,996,552 B2.
[7] Johnson, Joseph E, Ponci, F. 2008 Bittor Approach to the
Representation and Propagation of Uncertainty in Measurement,
AMUEM 2008 International Workshop on Advanced Methods for
Uncertainty Estimation in Measurement, Sardagna, Trento Italy
[8] Leibigot, Eric O., 2014, A Python Package for Calculations with
Uncertainties, http://pythonhosted.org/uncertainties/.
[9] Johnson, Joseph E. 1985 US Registered Copyrights TXu 149-239, TXu
160-303, & TXu 180-520
[10] Newell, David B., A More Fundamental International System of Units,
Physics Today 35 (July 2014) http://scitation.aip.org/content/aip/
magazine/physicstoday/article/67/7/10.1063/PT.3.2448
[11] Ernesto Estrada The Structure of Complex Networks, Theory and
Applications, ISBN 978-0-19-959175-(2011)
[12] Newman, M. E. J. "The structure and function of complex
networks" (PDF). Department of Physics, University of Michigan
and Newman, M.E.J. Networks: An Introduction. Oxford University
Press. 2010
[13] Bailey, Ken (1994). "Numerical Taxonomy and Cluster Analysis".
Typologies and Taxonomies. p. 34.
ISBN 9780803952591
https://en.wikipedia.org/ wiki/ cluster_analysis
[14] Theory of Lie Groups , https://www.math.stonybrook.edu/
~kirillov/mat552/ liegroups.pdf
[15] Johnson, Joseph E. 1985, Markov-Type Lie Groups in GL(n,R) Journal
of Mathematical Physics. 26 (2) 252-257
[16] Laplacian matrix, http://www.math.ucsd.edu/~fan/research/cb/
ch1.pdf , https://en.wikipedia.org/wiki/Laplacian_matrix
[17] Johnson, Joseph E. 2005 Networks, Markov Lie Monoids, and
Generalized Entropy, Computer Networks Security, Third International
Workshop on Mathematical Methods, Models, and Architectures for
Computer Network Security, St. Petersburg, Russia, Springer
Proceedings, 129-135 ISBN 3-540-29113-X
[18] Johnson, Joseph E. 2012 Methods and Systems for Determining Entropy
Metrics for Networks US Patent 8271412
[19] Johnson, Joseph E., 2009, Dimensional Analysis, Primary Constants,
Numerical Uncertainty and MetaData Software”, American Physical
Society AAPT Meeting, USC Columbia SC
[20] Johnson, Joseph E, & Campbell, William 2014, A Mathematical
Foundation for Networks with Cluster Identification, KDIR Conference
Rome Italy
[21] Johnson, Joseph E. 2014, A Numeric Data-Metadata Standard Joining
Units, Numerical Uncertainty, and Full Metadata to Numerical Values,
EOS KDIR Conference Rome Italy
Download