A Proposed Numerical Data Standard With Automated Analytics Support

A Proposed Numerical Data Standard With Automated Analytics Support Joseph E. Johnson, PhD Department of Physics, University of South Carolina, Columbia SC, 29208, USA jjohnson@sc.edu Abstract— (A) A standard is proposed for all numeric data that tightly integrates (1) each numerical value with (2) its units, (3) accuracy (uncertainty) level, and (4) defining metadata into a new object called a metanumber with full mathematical processing among such metanumber objects. In essentially all current data tables, these four components are distributed and often embedded in freeform text, in the title, references, row and column headings, and multiple non-standard locations within other text in the document. While usually readable by humans, this lack of structure frustrates the computer reading of electronic representations of data thus incurring increased costs, delays, and errors with severe consequences for Big Data. We propose this metanumber standard for published numeric data in order to lay a foundation for the fully automated reading (esp. Big Data) by intelligent agents with a framework that satisfies twelve criteria. Our standard, meeting these requirements, has been designed, programed, and is operational on a server in a Python environment as a multiuser cloud application using any internet linked device accessing standardized metanumber tables. (B) Our current work is now to automatically create such standardized metanumber data tables from existing web pages using a data normalization algorithm which acts on web sites to build the associated metanumber table assisted by an evolving user interface when necessary. (C) Finally, as an example of how intelligent algorithms can extract information from such a fully standardized numerical information tables, we have applied the advances we have made in the mathematical foundations and analysis of networks including an associated agnostic cluster analysis. This framework utilizes our proof that every network, as defined by a connectivity (connection) matrix Cij, is isomorphic to the Lie algebra (monoid) generators of all continuous (as well as discrete) Markov transformations thereby linking the extensive mathematical domains of Lie algebras and groups first with Markov transformations, and now to network analysis and then finally to our associated cluster analysis. We will describe the current results of our latest work which builds two networks from entity-property type standardized metanumber tables. One network is built among the entities (rows) while the other is built among the numerical properties (columns) of those entities. Thus as metanumber standard tables are created, our algorithms automatically create these two associated networks and then compute and display the nodal clustering first among entities and then separately among properties. We seek to fully automate this entire process from web sites to metanumber standardized tables to the associated networks to the resulting cluster analysis with an automated intelligent analysis for data sets esp. Big Data. Keywords—Units, Metadata, Uncertainty, Standards, Clusters. I. INTRODUCTION If one examines electronic data from most sources1,2,3, it becomes obvious that the units and the exact definitions of the values are hidden in text in multiple locations and with multiple formats. While normally easily understood by human intelligence, it is also obvious that computers are unable to accurately infer the exact units for each value as well as their difficulty in extracting the exact meaning of the value in an unstructured environment. Furthermore, assuming that the accuracies of the values are reflected in the number of significant digits, then when values are processed by a computer, the results are values with essentially as many digits as the computer retains in single or double precision. Thus the number of significant digits is lost unless they are captured immediately upon reading and then retained correctly throughout the execution of all mathematical operations. Otherwise the accuracy of the results is obliterated with the first operation. Big Data problems refers both to those data sets of vast size which require extensive processing speed, storage, and transmission rates, and often more so to the massive problems incurred with the astronomical number of existing tables when each is formatted and structured in diverse ways. This demands human preprocessing to convert each table to the users framework for subsequent computation thus frustrating automation. In conclusion, the examination of most published tabular data suggests that they are designed for space compression and easy reading by humans. As space is now rarely a problem, and as human preprocessing becomes unacceptable, we need to achieve simultaneous human and computer readability. We call this standardized string object a “metanumber” as it is actually a string, tightly linking all four information components. These are to be managed through all mathematical operations to give new “metanumbers” containing the correct units, accuracy, and metadata tags. Yet to achieve standardization requires far more than just establishing some structure. It requires (a) that one find intermediate workable solutions to convert existing data to the standard framework until such time that this standard becomes accepted for publishing data, and (b) that such a system is in keeping with current standards in computer science as well as the domain disciplines.4,5 We have developed an unambiguous standard but which is also very flexible in accommodating existing standards in specialized subject domains. This standard satisfies the following requirements. It is to (a) be easily readable by both humans and computers, (b) be consistent with international scientific units, the SI metric foundation with automatic management of dimensional analysis including other units used in specific subject domains, (c) provide automatic computation and management of numerical uncertainty, (d) support a defining metadata standard for the unique representation each number and its exact meaning, (e) allow the linkage of unlimited metadata qualifiers and tags without the burden of an associated data transfer or processing, and (f) be sufficiently flexible to evolve and encompass existing domain standards that are currently in use. The structure must also support (g) computational collaboration and sharing among user groups, (h) be compatible with current standards in computer coding, and (i) provide an algorithm to convert, as best possible, current nonstandard representations to this standard, until such time as data is published in this standard. This standard should also support (j) the analysis of a number’s complete historical computational evolution, (k) optional extensions to the SI base units, and (l) be configured as a module or function that can be included in other systems. In the second half of this paper, we will also demonstrate ways in which intelligent agents can operate automatically upon such standardized data, utilizing our recent work on the mathematical foundations of networks and associated cluster identification algorithms. This application utilizes a means of converting metanumber tables of entities with their properties into a form that allows conversion to a network thus supporting an automated procedure for cluster identification for each new standardized metanumber table. Our cluster identification algorithm provides an example of how this metanumber standard can support advanced intelligent analytics. II. STANDARDS DEFINED A. A Numerical Accuracy and Uncertainty Standard While some numbers are exact such as the number of people in a room or the value (by definition) of the speed of light, almost all measured values are limited in accuracy as is represented by the number of significant digits shown. In a very few cases one might know the standard deviation of the normal probability distribution but such knowledge is not generally available. The subject of the correct management of numerical accuracy (uncertainty) is very complex because one is actually replacing a real number with a probability distribution which is a function. This function needs multiple other real numbers to specify it unless one is confident that it is a normal probability distribution, but even then such functions do not close into other simple functions under all mathematical operations. While the author’s research into this subject led to a new class of number 6,7 (similar to one based upon quantum qbits), that method is so complex that we do not wish to utilize it in this initial standard framework. The programing of the retention of the correct representation of accuracy is also a complex task. Thus we were fortunate that an existing Python module (python uncertainty)8 has been developed that can be joined to our software. Because the number of significant digits is always available, we have chosen the Python Uncertainty module for the management of numerical accuracy where our python functions convert a given real into a new class of number called a “ufloat” which retains both the numerical value (represented with correct . accuracy) along with the associated uncertainty which arises from the limited number of digits. Thus it only remained for us to encode methods for when and how to convert a value to a ufloat or to keep it exact. Our standard is to represent any exact value with no decimal point present. This is achieved by adjusting the exponent in scientific notation to remove the decimal. If a value has a decimal point present then it is to be treated as uncertain having the number of significant digits shown. Thus 5.248e5 will be converted automatically to a ufloat, but that same value, if exact, would be entered as 5248e2 and treated as an exact real number as there is no decimal. A second option that is offered utilizes the upper case “E” for scientific notation for exact numbers with the lower case “e” for uncertain values. Our Python algorithm automatically converts all keyed and data table values to the correct form as just described while all unit conversions and internal constants are treated as exact values. A virtue of this system is that no other explicit notation is needed to encode numerical accuracy as it is automatically executed as described based upon the presence or absence of a decimal. Furthermore, the Python Uncertainty module also supports other forms of uncertainty which can be invoked in future releases of our metanumber software. B. Units & Dimensional Analysis Standard Our standard format for attaching units to a numerical value is to represent all units by variable names9 which are to be attached so that the expression is a valid mathematical expression such as the velocity 4.69e2*m/s. When data is expressed in tabular form, one has a choice of compressing the notation by writing the units for an entire table, or row or column at the top and then requiring the subsequent multiplication to attach the units to the value to be executed at the time of use. While this is the more common convention and requires less storage space, it also requires more processing time when each value is retrieved. As space is rarely now a problem, we have chosen to adjoin the units directly to each value thus increasing storage, reducing processing time, and improving human readability. Naturally, when units are homogeneous over an entire table, row or column, then an alternative notation allows the units to be adjoined at the time of retrieval with the result that a metanumber standardized table in this format requires no more space than otherwise required but increases the processing. That method, offered as an option, labels a row or column as “*units” which causes the units that follow in that row or column (or even the table) at the time of retrieval to be adjoined to the value. The general rules for unit names are given as follows. Units (1) are variable names which are parts of valid mathematical expressions: 3*kg+2.4*lb. The conversions are automatically executed by the metanumber software. (2) Are single names such as “ouncetroy” not “troy ounce” since each unit is a single variable name. (3) Are lower case with only a…z, 0-9, and the underscore “_” character: e.g. mach, c, m3, s_2, gallon, ampere, kelvin. This is in keeping with modern software conventions which have variables in lower case and where variable names are case sensitive. The convention of lower case also removes the potential case ambiguity. (4) Never have superscripts or subscripts or any information contained in font changes: m3 not m3 . Such fonts are sometimes inconsistently treated in different software languages and thus no information may be represented by a change in font. (5) Never use Greek or other characters or special symbols: ohm not Ω, hb not ħ, pi not π. (6) Are defined with nested sequential definitions in terms of the basic SI units: m, kg, s, a, k, cd. The Python code defines each unit in terms of previous units back to the basic SI standard. (7) Use singular spelling and not plural: foot not feet, mile not miles. This is to avoid spelling mistakes and to reduce the code size. (8) Includes all dimensionless prefixes as separate words, to be compounded in any order as needed: micro* thousand* nano* dozen* ft. Note that the “*” or “/” must be present between each unit or prefix. (9) Evaluation of expressions always results in SI (metric) units as the default. To obtain the result in any other units one uses the form (expression) ! (units desired) e.g. c ! (mile/week). This form can be used to obtain a result in any desired units. (10) Dimensional errors are trapped: 3*kg + 4*m gives an error. (11) All unit conversion values and internal constants are treated as exact numerical values without uncertainty. (12) A user can use a past result with the form: [my_423] where 423 is the sequence number of a previous result. (13) Four new units are optionally available: bit, or b for a bit of information (T/F, 1/0); person, or p for a living human; dollar, usd, d for the US dollar; and flop (for a single floating point operation). These units vastly extend the dimensional analysis and clarity of the meaning of expressions by allowing both information based concepts such as flops = flop/s and baud = bit/s, as well as socioeconomic units of p/acre (population density), and d/p (income as dollars per person). (14) Users can define units in a jointly shared table with [_unitname] which can be used like a unit or variable. It is to be created by the command {! unitname = definition}. A user can find the list of internal units on the web site (www.metanumber.com) and one should familiarize oneself with the spelling and available internal units especially those with short names that are abbreviations. With the metanumber system one can calculate the gravitational force between two masses easily even when the values are expressed in diverse units such as between 15 lbs, and 12.33 kg that are 18.8 inches apart as: g*15*lb*12.33*kg/(18.8*inch)**2 using F=G m1 m2 /(r12)2. C. The Metadata Framework Standard. Standardized metanumbers are to be electronically published in tables of different dimensions: zero dimensions for a single value, one dimension for a list of values such as masses, and two dimensions for a two dimensional table or array often representing entities in rows with the properties of those entities in columns. For one dimensional data the value can be retrieved with the form: [mass_proton] where “mass” is the name of the table on the default directory on the default server and where “proton” is the unique name of the row given in column one. For a two dimensional table one might write [e_iron_thermal conductivity] which would retrieve the thermal conductivity of iron in the elements table abbreviated as “e”. Thus the general multidimensional form for retrieving standardized metanumbers is [file name _ row name _ column . name]. For comparison, these names (like database indices) must be unique. The program removes all white space and lowers the case for each index string prior to comparison. If the table is not in the default directory and server then the metanumber is retrieved as [internet path to server_directory path __ file name _ row name _ column name]. It is of great importance to realize that with this standard that this expression provides a unique name to every standardized numerical value (metanumber) within a table with unique row and column string indices. These expressions can be used as variables in any mathematical expression such as 5.4 times the ratio of copper density to zinc density as 5.4* [e_copper_density]/[e_zinc_density] . For books the file name might be the ISBN number followed by the path to that table or value in the electronic document. This design allows an automated conversion of all metanumber tables into a relational database structure or SAS data sets. Our standard however is a simple flat file in comma separated values (CSV) form. For a given table, there is metadata that applies to the whole table such as values for the variables: table abbreviation, name full name, data source (s), name of table creator, email of table creator, date created, security codes (optional), units (for all entries), and remark. These variable names occupy the first row beginning in column two with <MN> as occupying the cell at row one, column one to identify the table as a valid metanumber standardized table. The values for these variables are to be in the corresponding columns in row two. This provides the flexibility for additional variables. The format of the table is to be in comma separated values (CSV) which is easily created or viewed in spreadsheet software. Thus commas are not allowed. Subsequent rows (3,4 …) that have a name in column one preceded by a “%” such as %remarks indicates that this row contains metadata applicable for the table as a whole and thus the corresponding rows can contain anything. The first row that begins with the symbol %% in column one indicates that this row and subsequent values in column one are to be the primary unique indices that can be used to point to values. Metanumbers are those cells in the corresponding matrix that do not have row or column index names that begin with a “%” symbol as such names indicate that the corresponding row or column contains only metadata to be associated with that row or column. Row or column names beginning with %% can be auxiliary unique index names for the corresponding row or column such as using an element symbol (with column labeled as %%symbol) or its atomic number (%%atomic number) rather than its full name. Supporting metadata can be located in rows or columns that give web addresses for more extensive information, equations for use, or video lectures or other information or metadata. Thus in the element table, one could have a row of internet links under the indices like density and thermal conductivity that give additional information, equations, or properties. The actual metanumber values are those which do not have a “%” sign preceding the associated row and column index. As a consequence of this design, the simple path to a metanumber [table_row_column….] provides a universally unique name. This form is used to retrieve the metanumber and use it in expressions as a variable. This path name also denotes the path that provides unlimited metadata associated with the value indicated, without the transfer of that metadata. This design allows unlimited metadata tags to be linked to pharmaceuticals, accounting expenditures, medical procedures, scientific references, equipment or transportation items and other conditions or environments such as reference temperatures. While the above framework supports metadata that attaches to an entire table, row, or column of values, one often must attach metadata to specific metanumbers such as longitude, and latitude. This can be accomplished with the form *{var1= val1 | var2 = val 2 |…} which can be multiplied at the end of any metanumber to provide information specific to that value such as *{lon=…|lat=…|time=…}. Of special note is the missing value {.} or {.|reason = div by zero} and the questionable value {?} or {?|reason = ….}. Any missing value in an expression is to give a missing result when evaluated. Any questionable value is to compute normally but then {?} is attached to the result. When the expression “{anything}” is encountered in a mathematical expression, it is converted to ‘1’ thus 3.45*{anything} results in 3.45. Thus the form {anything} is a container for attaching information in the mathematics without altering the results. Our metadata and metanumber standards are compliant with NIST4 and USGS5 recommendations as well as the latest new basis for the SI / metric standards that for the first time will base all fundamental units upon fundamental constants10 D. Other Components of the MetaNumber Standard. There is a single table that contains all actions of all users containing the following variables: userid_seq#, date time, input string, evaluated result, unit id# of result, optional subject. The userid_seq# is a unique index for each line in the archive. A user can access their own past archived file of inputs and evaluated results for a seq# value or for a given active subject. Further details can be found by entering “help” which will give the procedures for the latest software release or in the web user guides. It is often important to document ones work. This can be accomplished by entering the # symbol as the first symbol on an input line and anything else can follow this symbol. This is like the comment line for Python code and one can insert documentation text for later use that describes ones calculations. When the system sees the “#” symbol in the first position, then all processing is bypassed and the subsequent string is stored as the result in {# text information…}. A series of calculations can be collected under a ‘subject’ name by entering the line “{!subject = text}” then the “subject” variable for that user is set to the “text”. The subject name is retained until “{!subject= null}” or “{!subject = }” is entered or until a user logs out. This enables one to identify a set of calculations and remarks under a subject heading for later output as an individual user. {!subject = ?} will output the current subject. A user can also set a subject and identify other users who are allowed to see and retrieve the entries of other team members while they have the same subject set. The instructions for this are under “help”. Another feature of the metanumber system is a metanumber value which is itself a mathematical expression. For example the speed of sound in air is dependent upon the operating temperature “To” in Kelvin which can be expressed as (331.5+(0.600)*(To-273.15*k))*m*s_1. The use of such expressions, when applicable, greatly enhances and compresses the required storage. III. UTILITY OF THE METANUMBER STANDARD FOR AUTOMATED AGENTS A. Networks and Clustering. As an example of how automated intelligent agents can operate unassisted in the analytical analysis of numerical information, we will explore how we are developing such agents to convert metanumber tables to networks11 and to then find clusters within those networks. It is well known that cluster analysis is one of the foundations of intelligent reasoning. Our very language is built upon names for entities that cluster in use, appearance or function. It is also the basis for the classification of biological organisms and the names for objects in our everyday life. Yet there is no generally agreed upon definition of clustering from a mathematical or algorithmic basis and there are over a hundred current algorithms for identifying clusters. The following will give a very brief overview of our research into the mathematical foundations of networks and an associated agnostic algorithm for cluster identification based upon an underlying Markov model with a probability based proximity metric. We will utilize our previous research with a means for converting two dimensional metanumber standardized tables of entities (rows) and their properties (columns) into a network for the collection of entities and one for the properties. We will provide the algorithm for finding clusters for both networks. This algorithm has been built to run automatically on all metanumber tables that are generated by the standardization algorithm. We first give a brief overview of our past research. Networks represent one of the most powerful means for representing information as interrelationships (topologies) among abstract objects called nodes (points) which are identified by sequential integers 1, 2,..n. A network is defined as a square (sparse) matrix (with the diagonal missing) that consists of non-negative real numbers, and which normally is very large, where the strength of connection between node i and node j is Cij. The diagonal is missing because there is no meaning for the connection of a thing to itself. The off diagonal elements must be positive because there is no meaning to a connection that has less than no connection at all. The mathematical classification and analysis of the topologies represented by such objects is one of the most challenging and unsolved of all mathematical domains12. Cluster analysis13 on these networks can uncover the nature of cohesive domains in networks where nodes are more tightly connected. There is no single agreed upon formal definition of a cluster. B. Lie Algebras & Groups and Markov Monoids This section will give a very brief overview of our mathematical results that provide a new mathematical foundation for network analysis and cluster identification. These results are built upon other previous work the author discovered in developing a new method of decomposing the continuous general linear (Lie) group14 of (n x n) transformations. That decomposition15 showed the general linear Lie group to be the direct sum of two Lie groups: (1) a Markov type Lie group (with n2-n parameters) and an Abelian scaling group (with n parameters). Each group is generated as is standard, by exponentiation of the associated (Markov or scaling) Lie algebra. The Markov type generating Lie algebra consists of linear combinations of the basis matrices that have a “1” in each of the (n2-n ) off-diagonal positions with a “-1” in the corresponding diagonal for that column, (an example of a Lagrange matrix16). When exponentiated, the resulting matrix M(a) = exp(a L) conserves the sum of elements of a vector upon which it acts. By comparison, the Lie algebra and group that preserve the sum of the squares of a vector’s components is the familiar rotation group in n dimensions. But in this form the Markov type transformation can transform a vector with positive components into one with some negative components (which is not allowed for a true Markov matrix). However, if one restricts the linear combinations to only non-negative values then we proved that one obtains all discrete and continuous Markov transformations of that size. This links all of Markov matrix theory to the theory of continuous Lie groups and provides a foundation for both discrete and continuous Markov transformations. There are a number of other details not covered here. In particular, the number of terms in the expansion of the exponential form of the Markov Lie algebra matrix gives the number of degrees of separation. C. Lie Algebras & Groups and Markov Monoids Our next discovery17 was that every possible network (Cij ) corresponds to exactly one element of the Markov generating Lie algebra (those with non-negative linear combinations) and conversely, every such Markov Lie algebra generator corresponds to exactly one network! Thus they are isomorphic and one can now study all networks by studying the associated Markov transformations, the generating Lie algebra, and associated groups18,19. Our subsequent recent discoveries20,21 are that (a) all nodes in any network can be ordered by the second order Renyi entropies of the associated columns (and rows) in that Markov matrix thus representing the network by a pair of Renyi entropy spectral curves and removing the combinatorial problems that frustrates many areas of network analytics. A second Renyi entropy spectral curve can be also generated using the rows rather than the columns for diagonal determination. Now one can both compare two networks (by comparing their entropy spectral curves) as well as study the change of a network’s topology over time. One can even define the “distance” between two networks (as the exponentiated negative of the sum of squares of differences of the Renyi entropies or if there are changes in the number of nodes, one uses a spline smoothing. We recently showed that the entire topology of any network can be exactly represented by the sequence of the necessary number of Renyi entropy order differences. These converge very rapidly. This is similar to the Fourier expansion of a function such as for a sound wave where each order represents successively less important information. Our next and equally important result was that an agnostic (assumption free) identification of the n network clusters is given by the eigenvectors for this Markov matrix. This not only shows the degree of clustering of the nodes but actually ranks the clusters using the corresponding eigenvalues thus giving a name (the eigenvalue) to each cluster. A more descriptive name can be created from the nodal names (indices) associated with highest valued nodes (when sorted by eigenvalue) and as given by the associated eigenvector. The reasoning underlying the cluster identification is that the Markov matrix generated by the altered connection matrix (Lie algebra element) has eigenvalues that successively represent the rate of approach to equilibrium for a conserved substance that flows among the nodes in proportion to their degree of connectivity. Thus network clusters result in a higher rate of approach to equilibrium as measured by the associated eigenvalue for flows among the nodes identified by the eigenvector. D. Metanumber Tables Define Netwroks and Clusters Now let us recall the two dimensional metanumber standardized storage of entities and properties of those entities. For simplicity, let us just consider the table of metanumber values with a single unique index label for the row and another for the column. One example we have explored is the table of the chemical elements (rows) versus the 50 or so properties of those elements while another is the nutrition table with 8,400 foods verses 56 numerical properties of their nutritional contents. Another such table would be personal medical records of persons verses their approximately one hundred numerical metrics of blood, urine, and other numerical parameters including multiple records for each person identified by time. Another would be the table of properties of pharmaceutical substances including all inorganic and organic compounds, or even a corporation’s manufactured items. Our most recent development is that one can generate two different networks from such a table of values Tij for entities (such as the chemical elements in rows) with properties (such as density, boiling point) for each element in a column (the columns have identical units). To do this we first normalize the metanumber table by finding the mean and standard deviation of each column and then rewrite each value as the number of standard deviations away from the norm which is set numerically to zero. Since the dimensionality for each property has the same units, then this process also removes the units that are present and the new values are dimensionless. For example the density or thermal conductivity column will have values with all the same dimensionality. We then define a network Cij among the n entities (here the elements listed in rows) as the exponentiation of the negative of the sum of the squares of the differences between Tik and Tjk thus Cij = exp-k(Tik - Tjk)2. This gives a maximum connectivity of ‘1’ if the values are all the same and a connection of ‘0’ if they are far apart as would be expected for the definition of a network. This is similar to the expectation that a measured value actually differs from the true value by that degree. We can form a similar network among the properties for that table. One then, as before, adjusts the diagonal to be the negative of the sum of all terms in that column to give a new C, forms the Markov matrix M(a) = exp (aC) and finds the eigenvectors and eigenvalues for M to reveal the associated clustering. The rationale for how this works can be understood when the Markov matrix is viewed as representing the dynamical flow of a conserved hypothetical substance among the nodes, the result of its action on a vector of non-negative values whose sum is invariant. This methodology includes and generalizes the known methodology with a Lagrange matrix. We are currently exploring the clusters and network analysis that can be generated from tables of standardized metanumber values. We have done this for the elements table which was very revealing as it displayed many of the known similarities among elements. The cluster of iron, cobalt and nickel was very clear as well as other standard clusters of elements such as the halogens and actinides. We are currently analyzing a table on the properties of pesticides and we are also performing this analysis on a table of 56 nutrients for 8600 natural and processed foods. The study of clustering in foods based upon their nutrient and chemical properties as well as the clustering among the properties themselves will be reported as available on the www.metanumber.com web site. Our initial results show three of the dominant food-nutrition clusters to be oils for cooking, nuts, and baby foods. Results of this analysis are expected by the end of September of 2015. The numerical standardization in terms of metanumber tables lead to other more holistic networks which can be built utilizing the metanumber structures with resultant cluster analysis. Users (using the PIN#), can each be linked to (a) unit id (UID) hash values of the results of their calculations, as well as to (b) the table, row, and column names of each value. These linkages can be supplemented with linkages of each user to those universal constants that occur in the expressions which they evaluate. The resulting network links users to (a) concepts such as thermal conductivity, (b) substances such as silver alloys, and (c) core constants such as the Boltzmann constant, Planks constant, or the neutron mass. The expansion of this network in different powers, giving the different degrees of separation, can then link users via their computational profiles, (the user i x user j component of C) as well as linkages among substances, metadata tags, and constants. The clustering revealed in different levels of such expansions then can reveal groups of users with linkages that are connected by common computational concepts. Users working on particular domains of pharmaceuticals and methodologies are thus identified as clusters as well as groups of astrophysicists that are utilizing certain data and models. At that same time, the clustering can identify links among specific substances, models, and methodologies. Our current research is exploring such networks and clusters as the underlying metanumber usage expands. IV. CONCLUSTIONS A standard for the integration of numerical values along with their units, accuracy level, and exact unique path name (termed a “metanumber”) has been proposed and programed in python on a server open for use at www.metanumber.com and compliant with the twelve requirements listed. This standard is sufficiently defined to be totally unambiguous, yet support sufficient subject domain flexibility to accommodate other existing standards. Each metanumber (a) value has (b) the units adjoined as mathematical ‘variables’, (c) the representation of accuracy by the presence of a decimal point, and (d) a unique metadata path for every stored numerical value. This unique path name for every numerical value not only defines that number unambiguously, it also provides the link to unlimited descriptive metadata tags. The permanent archiving of every input and result for each user with sequence number, datetime, unit hash value, and optional subject, has the critical feature that the exact mathematical structure, with all exactly uniquely named metanumber variables, provides with this log, the mathematical computational history and path for every computed new number. The fact that the input expression is stored with the result in this archival log, means all metadata associated with each metanumber is traceable but not encumbering to the system. Our current work indicates that our new algorithm can effectively preprocess data from web sites and reformat it in the metanumber standard (with questionable units and indices flagged for the user). Our parallel work can take an entity vs property table and automatically generate two networks, one among the entities as nodes and one among the properties as nodes. This then allows the determination of clusters, named by the associated eigenvalue and sorted by the magnitude of that eigenvalue as a metric of the “tightness” of the cluster. Nodal names associated with the highest weights in the eigenvector provide descriptive metadata describing the identity and composition of the cluster. The solid mathematical foundation of this automated analysis of data structures uses the theory of Lie algebras and groups, continuous and discrete Markov transformations, network topology, cluster analysis that can provide advanced analytics for still another stage of analysis. That next stage now set for an analysis of the topological networks among users, units, tables, primary constants, unlimited linked metadata, and models. The collaboration model, under a subject name, supports either an individual or groups of users working with secure data-model sharing with documentation. Finally, because the system supports full archival tracing with numerical accuracy, it follows that one can study the path dependence of information loss (similar to the path dependence of non-conservative forces in physics). The ability to automate cluster identification with associated metadata linkages is a key component of an intelligent system. But most important, this system supports the sharing of standardized data instantly among computers for research groups, corporations, and governmental (and military) data analysis with fully automatic dimensional analysis, error analysis, and unlimited metadata tracking. REFERENCES [1] NIST Fundamental List of Physical Constants http://physics.nist.gov/cuu/Constants/Table/allascii.txt [2] Bureau of Economic Analysis, NIPA Data Section 7 CSV format, http://www.bea.gov/national/nipaweb/DownSS2.asp [3] Statistical Abstract of the United States Table 945 http://www.census.gov/compendia/statab/cats/energy_utilities/ electricity.html [4] [5] NIST Data Standards http://www.nist.gov/srd/nsrds.cfm USGS Data Standards http://www.usgs.gov/datamanagement/plan/ datastandards.php [6] Johnson, Joseph E., 2006 Apparatus and Method for Handling Logical and Numerical Uncertainty Utilizing Novel Underlying Precepts, US Patent 6,996,552 B2. [7] Johnson, Joseph E, Ponci, F. 2008 Bittor Approach to the Representation and Propagation of Uncertainty in Measurement, AMUEM 2008 International Workshop on Advanced Methods for Uncertainty Estimation in Measurement, Sardagna, Trento Italy [8] Leibigot, Eric O., 2014, A Python Package for Calculations with Uncertainties, http://pythonhosted.org/uncertainties/. [9] Johnson, Joseph E. 1985 US Registered Copyrights TXu 149-239, TXu 160-303, & TXu 180-520 [10] Newell, David B., A More Fundamental International System of Units, Physics Today 35 (July 2014) http://scitation.aip.org/content/aip/ magazine/physicstoday/article/67/7/10.1063/PT.3.2448 [11] Ernesto Estrada The Structure of Complex Networks, Theory and Applications, ISBN 978-0-19-959175-(2011) [12] Newman, M. E. J. "The structure and function of complex networks" (PDF). Department of Physics, University of Michigan and Newman, M.E.J. Networks: An Introduction. Oxford University Press. 2010 [13] Bailey, Ken (1994). "Numerical Taxonomy and Cluster Analysis". Typologies and Taxonomies. p. 34. ISBN 9780803952591 https://en.wikipedia.org/ wiki/ cluster_analysis [14] Theory of Lie Groups , https://www.math.stonybrook.edu/ ~kirillov/mat552/ liegroups.pdf [15] Johnson, Joseph E. 1985, Markov-Type Lie Groups in GL(n,R) Journal of Mathematical Physics. 26 (2) 252-257 [16] Laplacian matrix, http://www.math.ucsd.edu/~fan/research/cb/ ch1.pdf , https://en.wikipedia.org/wiki/Laplacian_matrix [17] Johnson, Joseph E. 2005 Networks, Markov Lie Monoids, and Generalized Entropy, Computer Networks Security, Third International Workshop on Mathematical Methods, Models, and Architectures for Computer Network Security, St. Petersburg, Russia, Springer Proceedings, 129-135 ISBN 3-540-29113-X [18] Johnson, Joseph E. 2012 Methods and Systems for Determining Entropy Metrics for Networks US Patent 8271412 [19] Johnson, Joseph E., 2009, Dimensional Analysis, Primary Constants, Numerical Uncertainty and MetaData Software”, American Physical Society AAPT Meeting, USC Columbia SC [20] Johnson, Joseph E, & Campbell, William 2014, A Mathematical Foundation for Networks with Cluster Identification, KDIR Conference Rome Italy [21] Johnson, Joseph E. 2014, A Numeric Data-Metadata Standard Joining Units, Numerical Uncertainty, and Full Metadata to Numerical Values, EOS KDIR Conference Rome Italy

A Proposed Numerical Data Standard With Automated Analytics Support

Related documents

Products

Support

A Proposed Numerical Data Standard With Automated Analytics Support

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib