Document 15761463

advertisement
Numerical Data Standardization and
A Novel Cluster Analysis
Using Mathematical Networks
Joseph E. Johnson, PhD
Distinguished Professor Emeritus
Physics Department, University of South Carolina,
Columbia, SC, 29208
jjohnson@sc.edu
April 5, 2016
Numbers serve to discipline rhetoric.
Without them it is too easy to follow flights of fancy, to
ignore the world as it is and to remold it nearer the
heart's desire.
Ralph Waldo Emerson
The Problem:
• Numerical Data is incomplete without the
•
•
•
•
Value (the number itself)
Units (its dimensional basis)
Accuracy (its numerical uncertainty)
Defining Metadata (its exact meaning and “name”)
• But Published Data Cannot be Read by Computers Because:
• Units, Uncertainty, and Defining Metadata Indices are
• Non-standard, Scattered, Buried in Text, & in Multiple Locations
• Consequently:
• Data must be preprocessed by humans prior to every data transfer
• A costly, time consuming, error prone process
• Data exchange is tedious and AI is thwarted esp. for Big Data
But just imagine …:
• If every number came with its units, uncertainty, and exact meaning
all attached in a new type of object, a “metanumber”:
• a universal standard for all numerical data,
• Instantly readable by both humans and computers,
• And there was a simple unique name for every number,
• Which name could be used as a variable name in expressions,
• With all dimensional & error analysis automatically performed
• And all computations tracked allowing reuse of results
• Supporting AI, Big Data, and automated data exchange.
And Also Imagine :
• An algorithm continuously scraping all numerical data from the web
and converting it to standardized MetaNumber tables
• And then with a novel process, each table is converted to a network
• One among entities (rows) and one among properties (columns)
• From which the dominant clusters are extracted and ordered
• Using a novel agnostic cluster identification algorithm
• Then with still another algorithm, the dominant clusters are linked
into a single supernet spanning all numeric clusters.
• A single network spanning our entire numerical universe
Proposed Solution
• Design a standard for all numerical data that tightly links
• Value - Units – Accuracy – Meaning (metadata)
• We call this object a MetaNumber (MN)
• Design Algorithms to automatically evaluate MN expressions
•
•
•
•
Numerical results computed
Full dimensional analysis of units with error trapping
Automatic computation of numerical accuracy (uncertainty)
Tracking of all associated metadata
• MNs must be instantly readable by humans and computers
Other Requirements
• Use the SI metric system as the foundation
• Allow unlimited attached metadata without encumbrance
• Be sufficiently flexible to encompass existing standards
• Support computational collaboration
• Be compatible with current standards in computer coding
• Track the complete computational evolution of each number
• Allow extensions to the SI base units & user defined units
• Allow an API call from other user programs to call MN.
• Results must be expressible in any desired units.
Numerical Accuracy:
• If there is no decimal then the value is treated as exact
• Change 5.43e4 to 543e2 to make the value exact.
• If there is a decimal then it is converted to a “ufloat”
• 5.43e4 becomes 5.43+/-0.01e4 automatically
• Accuracy is based upon the number of significant digits
• Unit conversion values and internal constants are kept as exact
• Keyed MNs and table values are treated as stated
•
•
•
•
>4+5
> 4.0 + 5
> 345e-2 * 88
> 3.45*88
>> 9
>> 9.00+/-0.10
>> 303.6
>> 303.6+/-0.9
• If the uncertainty value is not needed, ignore it, use the value
• But the value as listed represents the result.
Units - 1
• Are variable names in mathematical expressions:
• > 3*kg + 2.4*lb
>> 4.09+/-0.05*kg
• Are single names: ouncetroy not troy ounce.
• > 4*ouncetroy
>> 0.1244139072*kg
• Are lower case with only a…z, 0-9 characters, “_” (interior):
• mile, mach, c, m3, s_2, gallon
• kelvin or k not Kelvin,
ampere, amp, or a not Ampere
• No information allowed in fonts, or superscripts or subscripts
• Not m2 but m2
Not s-2 but s_2
Units – 2
• No Greek, other alphabetic or special symbols:
• ohm not W,
hb not ħ,
pi not π.
• Use singular spelling and not plural:
• foot not feet,
mile not miles,
gallon not gallons
• Are all defined in terms of the basic SI units:
• meter, kilogram, second, ampere, kelvin, candela
• Or abbreviated as m, kg, s, a, k, cd
Units - 3
• All prefixes are separate words, compounded any way:
• >micro*thousand*nano*dozen*ft
>> 3.6576e-12*m
• Allows metric, English, and common dimensionless prefixes
• All results are in SI (metric) units by default
• > 4*m * 8*m
>> 32.0*m2
• For results in other units use: (expression) ! (units desired):
• > 4*m * 8*m ! acre >> 0.00790734057749*(acre)
• c ! (mile/week)
>> 1.12663593737e+11*((mile/week))
• Dimensional errors are trapped: 3*kg + 4*m gives an error
Units – 4 - Past Results (history)
• Results are sequentially numbered for each user, 1,2,…
• Past results cannot be erased
• A user can use a past result as hist(18) or [my_18] as:
• > 2* [my_18]
• The history file for a users can be retrieved as
• history 153 20 begins with value 20 and provides 153 values
• If history number offset has offset missing, then 10 is default
• Past results of each user are archived in a table as:
• username, seq#, datetime, input string, result, uid, subject
• The optional subject is set with: subject = any name
Units 5 - Optional Base Units
• Six new units are optionally available:
•
•
•
•
•
•
bit, b;
flop; op;
person, p;
dollar, usd, d;
bn;
ln;
a binary bit of information: 1/0 or T/F
a unit of one floating point operation
a unit of a living human
a unit of the US dollar
baryon number (conserved like charge)
lepton number (conserved like charge)
• These allow a vast extension of the SI system in the economic and social
sciences:
•
•
•
•
Population density:
Income:
Information flow rate:
Processing speed:
• Where flops = flop/s
45.2*p/hectare
34.7e3*d/(p*year)
4.2*giga*bit*s_1
3.8*tera*flops
Metadata - 1
• Any MN can be multiplied by {var1=val1 | var2=val2|…}:
•
•
•
•
6.4533e-4*kg*m_3*{lon=44*deg|lat=81*deg|depth=34.3*m}
The expression {…anything at all….} always evaluates to “1”
Thus {…} is a container for information attached to a MN
{.} or {.|reason = div by zero} to indicate a missing value
• Input expressions containing missing values results in a missing value
• {?} or {?|possible read error for x1} indicates a questionable value
• Computation proceeds normally but {?} is attached to the result
• Commands can be given in the form:
• subject = value to set a subject for a user
• subject = to remove a subject
• subject = ? to determine the current subject
Metadata - 2
• Metadata that applies to every metanumber in a row or a
column can be placed in that row or column such as a web
link.
• The fact that such placements are in rows or columns is
indicated by a heading for that column or row name that is to
begin with a “%”. For example with the elements:
•
•
•
•
%%Name
%Web Link
Hydrogen
Helium
%% Symbol
www.nist.ce
H
He
Density
Atomic Weight
www.acs.den www.chem.com
0.002*kg*m_3 1.00032*u
….
MetaNumber Archived Values
• MetaNumbers are stored in tables on the default server:
• The table dimension can be 0, 1, 2, …
• The table, row, and column have unique names (indices) for
MN values
• A value is accessed in a two dimensional table as :
• [table name_ row name _ column name…]
• This gives a path to indicate the exact location of the
metanumber
• Thus each archived MN has a unique name
• For example: [e_gold_thermal conductivity] will be replaced by
the thermal conductivity of gold from the elements table (e)
• MetaNumbers stored at any location other than the default
server are indicated by:
• [internet path_directory path__table name_ row name _ column
name…]
• This allows the unique path name to uniquely name every MN
• So every number has a unique name
• This “name” [ …] can be used as a variable in any expression
• Thus [e_lead_density]/[e_gold_density] gives the density ratio
• 0.58822+/-0.00006
Documentation
• An input of: # AnyString, is not processed as a an expression (#
is in position 1)
• It is saved as a comment string as a result for that seq#
• This allows one to document ones work and explain the
process
• Thus one can compose a document that will need only minor
editing.
Sample Problems 1
• expression ! units desired gives the result in units desired
• >16*ft + 38*inch - 1.2*m ! ft
>> 15.23+/-0.33*(ft)
• Metric prefixes can be used in any order
• > 456* billion*kilo* dozen*giga* million >> 5.472e+30
• > A farmer has 382.6 acres and needed 1.8 additional
inches of rain. How many gallons would this be?
• > 382.6*acre*1.8*inch ! gallon
>> (1.87+/-0.10)e+07*(gallon)
• Here one simply multiplies the area times the height to get the
volume. Units are automatically managed
Sample Problems 2
• What is the gravitational attraction between a 222.3 lb
man and a 85.1 kg golf cart 6 ft away?
• Use F = G m1 * m2 / d 2 to get the force in newtons.
• Note that the constant G is given by “g”
• >g*222.3*lb*85.1*kg/((6*ft)**2)
• >> (1.7124+/-0.0022)e-07*m*kg*s_2
• A sports car accelerates from 0 to 60 in 2.1 seconds.
How many “ g’s “ will this give (ag is acceleration due to
gravity)
• > ((60*mile/hour)/(2.1*s)) ! ag
• >> 1.30+/-0.06*(ag)
Now Available at:
www.MetaNumber.com
• MetaNumber Web Site: Instructions, videos, …
• MetaNumber expression calculations
• APIs to link MetaNumber with your code
• MetaNumber Standardized Tables
• Automated network and cluster analysis of MN tables.
• Supernet (not yet..)
Our Appreciation to:
• Office of USC VP of Research
• for the ASPIRE I & II supporting grants
Faculty CoPIs – Phase I:
•
•
•
•
•
•
•
•
•
•
•
•
Dr. Don Jordan, Mathematics
Dr. William Hogue, USC CIO & VP
Dr. Phil Moore, Director RCI
Dr. Bob Mullen, Chair , Civil Engineering
Dr. Paul Huray, Electrical Engineering
Dr. Camelia Knapp, Geological Sciences and ESRI Director
Dr. Tammi Richardson, Biology
Dr. Dwayne Porter , Chair Public Health
Dr. Geoff Scott, New Chair, Public Health
Dr. Kendra Albright, Library and Information Science
Dr. Susan Rathbun-Grubb, Library and Information Science
Dr. Bert Ely, Biology & Director Center for Science Education
Faculty CoPIs - Phase II:
•
•
•
•
•
•
•
Dr. William Hogue, USC VP and CIO
Dr. Phil Moore, Director Research CyberInfrastructure (RCI)
Dr. John Rose, Computer Science & Engineering
Dr. Gabriel Terejanu, Computer Science & Engineering
Dr. Francisco Blanco-Silva, Mathematics
Dr. Kendra Albright, Library and Information Science
Dr. Amir Karimi, Library and Information Science
Student Associates
• Phase I
•
•
•
•
•
•
•
•
•
•
•
•
Christian Merchant
Jordan Bradshaw
Brian Flick
Ben Torkian
Jun Zhou
William Hannah
Ethan Anderson
Ryan Giannelli
Dan Ramage
Lisa Wicklifte
Ideen Ghorbani
Samuel Watson
• Phase II
•
•
•
•
•
•
•
•
•
•
•
•
Sara Chizari
Roy Myers
Nick Puig
Dylan Shields
Sourav Das
Ideen Ghorbani
Jeevandeep Samanta
Chandra Raj Venkat
Chizhen Wu
Edward Pace
Ben Torkian
Sai Krishna
Thank you for your interest.
• www.metanumber.com
• www.asg.sc.edu/metanumber
• Visit these sites to get the latest in functions and operations
• Joseph E. Johnson, PhD
•
•
•
•
•
•
•
•
•
Distinguished Professor Emeritus
Department of Physics and Astronomy
Office:
Room 405, Physical Science Center
Office:
803-777-6431
Cell:
803-920-1229
Email:
jjohnson@sc.edu
University of South Carolina
Columbia, SC, 29208, USC
Web Site:
www.asg.sc.edu & www.metanumber.com
28
Next you may wish to visit:
• MetaNumber Networks and Cluster Analysis
• These networks can be derived from the MetaNumber
archived tables of metanumbers.
• The subsequent cluster analysis can provide insight into
information that is hidden in a users data.
Related documents
Download