specifications of the pedotransfer rules project

advertisement
SPECIFICATIONS OF THE PEDOTRANSFER RULES PROJECT.
______________________________________________________________________________
Joel DAROUSSIN
Unite de Science du Sol
Institut National de la Recherche Agronomique
INRA - Avenue de la Pomme de Pin
BP 20619 - Ardon
F-45166 OLIVET CEDEX
France
Tel: +33 (0)2 38 41 78 42
Fax: +33 (0)2 38 41 78 69
E-mail: Joel.Daroussin@orleans.inra.fr
http://www.inra.fr
http://www.orleans.inra.fr/Centre/Unite/Carto/Carto.html
Last
Last
Last
Last
Last
update
update
update
update
update
of
of
of
of
of
this
this
this
this
this
file:
file:
file:
file:
file:
19/12/94
05/02/95:
25/03/96:
18/07/96:
23/07/96:
minor changes and fixes.
version 2.0 of the project.
version 3.2 of the SGDE.
new computation of confidence levels.
SUMMARY:
-------1 - Objective of the work:
2 - General specifications:
2.1 - Dataset, objects, attributes, values, NODATA:
2.2 - Rules, occurrences, input attributes, output attributes, facts,
inferences:
2.3 - Wild cards:
2.4 - Confidence levels:
3 3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Technical specifications:
- Expert type rules:
- Class type rules:
- Other rule descriptors:
- Dataset:
- Naming conventions:
- Class rule structure:
- Expert rule structure:
- Item coding and constraints:
- Rules data:
4 - Project organization:
1 - Objective of the work:
-------------------------The purpose of this work is to implement the capability to provide information
necessary to a particular field of interest (in the present case: environmental
studies) using that information which is available in another field of interest
(in the present case: description of soils). Transcription of information from
one field to the other is done by applying transfer rules (in the present case:
pedotransfer rules).
Implementation of the system will take place within the Arc/Info Geographical
Information System (GIS) software package, using it's macro programming language
(AML: Arc Macro Language). The reasons for this choice are 1) the database of
available information (soils description) is stored and managed within Arc/Info,
2) the resulting information (environmental parameters) has to be stored and
managed within Arc/Info for map display and database query purposes, and 3) this
implementation has to be made within time and means limits which do not allow
for acquisition of - and staff training to - a specialised software.
Although the implementation is tailored for general utilization (within the
context stated above), it is firstly meant to provide the European Environmental
Agency with spatialized environmental indicators that can possibly be derived
from the Soils Geographical Database of Europe at Scale 1:1,000,000.
2 - General specifications:
--------------------------2.1 - Dataset, objects, attributes, values, NODATA:
All the available information in the "from" field of interest is stored in a socalled "dataset" (e.g. the soils dataset). The dataset is physically stored as a
dataset Info file.
The dataset holds information about a number of "objects" (e.g. a number of
soils such as Luvisols, Cambisols, etc). Each object is physically stored as a
line or record in the dataset Info data file.
The objects in the dataset have a number of characteristics called "attributes"
(e.g. soils have a soil name, a texture, etc). Each attribute is physically
stored as a column or "item" in the dataset Info file.
Each object in the dataset has a particular "value" for each one of its
attributes (e.g. Rankers have a Medium texture). Each value is physically stored
at the intersection of the object's record and the corresponding item in the
dataset Info file.
Values generally follow a coding schema before being physically stored in the
dataset (e.g. soil name Ranker is encoded and stored as U, Medium texture is
stored as 2, etc).
Some objects might not be fully described when some of their attributes are
unknown (e.g. for such soil, texture is not known). An unknown value for an
attribute is called a "NODATA" value. As there is no pre-defined way of coding
and physically store NODATA values in Info files, each attribute coding schema
will have to make provision of a NODATA value code (e.g. 0 will mean unknown
texture).
2.2 - Rules, occurrences, input attributes, output attributes, facts,
inferences:
Soil Science experts of the project working group provide the system with socalled "pedotransfer rules". A "rule" is the mean by which new needed
information describing an object of the dataset can be derived - i.e. "inferred"
-, using expert knowledge in the field, from existing available information i.e. factual information or "fact" - describing the object (e.g. the depth to
rock of such particular soil can be inferred from its known soil name, parent
material and phase).
A set of rules holds all the usable knowledge to derive all the new needed
information from an available dataset. It is physically stored as a rules Info
database.
A rule holds all the usable knowledge to derive one single new information from
a fact (available information about an object). A rule is physically stored as a
rule Info file.
A rule can be seen as a statement of the form:
IF <available information is ...> THEN <new information is ...>
ELSE IF <available information is ...> THEN <new information is ...>
...
ELSE IF <available information is ...> THEN <new information is ...>
Each line in this statement is called an "occurrence" of the rule. An occurrence
is physically stored as a line or record in the rule Info file.
An occurrence can be seen as a statement of the form:
IF (or ELSE IF)
<factual value for attribute i is w
and factual value for attribute j is x
...
and factual value for attribute n is y>
THEN
<inform the object with value z for a new attribute m>
where attributes i to n are providing the factual information (values w to y of
an object), and attribute m is providing the new - inferred - information (with
value z).
Attributes providing the factual information are called the "input attributes"
to the rule. The attribute providing the new - inferred - information is called
the "output attribute" from the rule. The input attributes are physically stored
as columns or "input items" in the rule Info file. The output attribute is
physically stored as a column or "output item" in the rule Info file.
Example:
IF <soil name is Luvisol and parent material is "any" and phase is "any">
THEN <depth to rock is deep>
ELSE IF <soil name is Orthic Luvisol and parent material is Marl
and phase is "any">
THEN <soil depth is very deep>
ELSE IF <soil name is "any" and parent material is "any" and phase is Lithic>
THEN <soil depth is shallow>
As with the dataset, "values" are physically stored at each intersection of each
occurrence’s record and each input and output items in the rule Info file.
Input items in a rule must have the same definition (name, type, size...) and
coding schema as their corresponding item in the dataset.
An "inference" is the action of producing a new derived information to an object
according a) to the available information it provides, and b) to the rule which
is activated. It proceeds in 5 steps:
1. The input attributes are identified in the rule.
2. The values for these attributes are retrieved from the object in the
dataset and constitute a fact.
3. The occurrence of the rule which matches the fact is searched for.
4. The output attribute definition and value are retrieved from the
matching occurrence
5. and are added to the object in the dataset.
When a rule is activated on a dataset, one inference will take place for each
object of the dataset, one after the other. The result will be a new attribute
in the dataset, one for the whole dataset, to hold the new inferred values, one
for each object.
An attribute of the dataset that has been previously inferred using a rule is
further considered as storing available information. It can thus be used as an
input attribute to other rules.
2.3 - Wild cards:
It is difficult, if not impossible, for an expert to foresee all the cases that
can possibly occur in a set of available information. Furthermore, in some
cases, several, nay, many different values of a fact will lead to the same
conclusion (e.g. IF <texture is sandy or loamy or ...> THEN ...).
Therefore a "wild card" mechanism allows the expert to define occurrences of
rules which will match several different facts.
The "any" terms in the expressions of the last example above show such
situations.
The "any" wild card will be, by convention, denoted as a star character (*) in a
rule.
A fact for which an exact matching occurrence can be found will receive this
occurrence’s output attribute value.
A fact for which an exact matching occurrence cannot be found will receive the
output attribute value of the last occurrence of the rule that matches, if it
can be found using the wild card convention. This is assuming that an expert
rather builds a rule by refining its occurrences, considering the most general
cases before the most particular cases.
When no matching occurrence at all can be found for a fact, no value is provided
to the output attribute, thus leaving it "blank" (or "0" (zero) depending on the
output item's type). This can lead to confusion if blank (or 0) are possible
normal output values. Therefore having a fully "wild carded" occurrence as a
header of a rule will "pick up" all facts for which no information can be
provided and force the output value to, say the NODATA value.
Using these specifications, the above example would become:
Example:
IF <soil name is "any" and parent material is "any" and phase is "any">
THEN <depth to rock is unknown>
ELSE IF <soil name is Luvisol and parent material is "any" and phase is "any">
THEN <depth to rock is deep>
ELSE IF <soil name is Orthic Luvisol and parent material is Marl
and phase is "any">
THEN <soil depth is very deep>
ELSE IF <soil name is "any" and parent material is "any" and phase is Lithic>
THEN <soil depth is shallow>
The wild card convention simulates the logical OR operator.
2.4 - Confidence levels:
Expert knowledge is fuzzy and subject to evolution. Furthermore, the available
information on the one hand, and the inferences that can be made using that
information and the expert knowledge on the other hand, both have a certain
reliability. Therefore it is necessary to have a mechanism that will allow each
available information (or factual value) held in the dataset, and each infered
information (or output value) held in the rules database, to be complemented
with its reliability.
The reliability of an information is called its "confidence level".
Confidence levels are held by confidence level attributes, one for each
attribute of the dataset, and one for the output attribute of each rule.
Therefore each object in the dataset has a confidence level value for each one
of its attributes. And each occurrence of each rule has a confidence level value
for its output attribute.
The coding schema for confidence levels is the following:
v
Very low or no information
l
Low
m
Moderate
h
High
When an inference takes place, the following 4 steps complement those listed
above:
6. The output confidence level attribute definition is retrieved from the
matching occurrence,
7. and is added to the object in the dataset.
8. The confidence level of each input attribute of the object is determined
from:
. its own associated confidence level item, named <in_item>.CL in the
dataset,
. or, if the former is not found, the global confidence level item of
the object, named CFL in the dataset,
. or, if neither are found, as an assumed high (H) confidence level.
9. The minimum (worst) confidence level value is retrieved from the
confidence levels of all the attributes implied in the inference
process (input confidence levels of the object, and output confidence
level of the occurrence).
10. The resulting confidence level value is added to the output confidence
level attribute in the object.
We have seen that an attribute of the dataset that has been previously inferred
using a rule can be used as an input attribute to other rules. Its confidence
level will be used in the same way as for any other input attribute.
3 - Technical specifications:
----------------------------3.1 - Expert type rules:
When a rule is applied to a dataset, it is processed in the following manner:
1. Input items to the rule are located and checked in the dataset
2. and output and confidence level items are added (empty) to the dataset.
3. Then for each record in the dataset:
4. the combination of actual values for the input items are matched to their
corresponding combination in the rule Info file,
5. the corresponding value for the output item is retrieved from the rule Info
file,
6. the corresponding value for the output confidence level item is computed
from all the available input confidence levels in the dataset and the output
confidence level in the rule,
7. and finally these values are updated in the current record of the dataset.
Input and output items of a rule have a limited number of possible Info data
types. These are character (C), clear integer (I) and clear numeric (N). Any
other Info data type (date (D), binary integer (B) and binary floating point
numeric (F)) is not to be used in rule data files.
3.2 - Class type rules:
The rules described above are called "expert type rules" as opposed to "class
type rules". Class type rules are simple reclassification or recoding rules.
They are used in any of the following cases:
1) convert the Info data type of an input item in the dataset from an
unauthorized to an authorized type (e.g. B to I, or F to N),
2) reduce the number of different values for an input item (e.g. reclass
detailed texture classes into less detailed texture classes),
3) recode the values of an input item (e.g. change codes to a more "speaking"
coding schema),
4) a combination of the above cases.
Class type rules accept only one input item and produce one output item. The
input item has no limitation as to it's Info data type. The output item follows
the same limitations as those applicable to expert type rules.
Class type rules do not follow the wild card convention. Wild cards may not be
used there.
Class type rules do not hold an output item associated confidence level for
their occurrences. But if the input item has an associated confidence level in
the dataset, the class type rule copies it in the dataset to a confidence level
item associated to the output item.
Thus class type rules may or may not produce a confidence level item together
with the output item. Whereas expert type rules always produce a confidence
level item.
3.3 - Other rule descriptors:
Each occurrence of a rule is furthermore informed with the following:
- an author identification number,
- a last update date,
- and a pointer to a text file to hold free explanatory notes to give any more
details about the occurrence (not implemented at this time).
The rules database also holds a rules information file (DICTIONARY) and an
authors information file (AUTHORS).
3.4 - Dataset:
Each time a rule is activated or "fired", the input items to the rule are
checked against the dataset. All input items of a rule must exist within the
dataset. They must have the same definition (name, type, size) in both the
dataset and the rule.
Each item in the dataset may or may not have an associated confidence level
item. When an expert rule is fired, if an input item does not have an associated
confidence level in the dataset, it is assumed to have the best confidence
level.
3.5 - Naming conventions:
A rule is an Info file stored in the $PTRHOME/xxx_rules Arc/Info workspace,
where xxx refers to the domain to which the rules apply (e.g. eur32_rules refers
to rules applicable to the Soils Geographical Database of Europe at Scale
1:1,000,000 version 3.2).
It is named RULE<rule_number> in which <rule_number> identifies uniquely the
rule in the rule database.
Each record of the rule file is called an occurrence of the rule.
Input and output items in a rule follow the Arc/Info naming conventions with one
restriction: an item name must not exceed 13 characters. (The reason for this is
Info's 16 characters limit reached with the next naming convention for
associated confidence level items.)
An associated confidence level item has the same name as the item to which it is
associated but is suffixed by ".CL" (e.g. if item name is ITEM then associated
confidence level item name must be ITEM.CL).
3.6 - Class rule structure:
A class rule is one that classifies or recodes one and only one input attribute
into an output attribute.
COLUMN
ITEM NAME
WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME
INDEXED?
1 NUM_AUTHOR
2
2
I
Identification number of author of the occurrence of the rule.
3 LAST_UPD
8
8
D
Last update date of the occurrence of the rule.
11 NOTE
4
5
B
Pathname to an ASCII explanatory note file of the occurrence of the rule.
(Not used at this time.)
15 <out_item>
?
?
?
Output attribute from the rule.
? <in_item>
?
?
?
Input attribute to the rule. In the case of a class rule, there is only
one input attribute.
** REDEFINED ITEMS **
1 CLASS_RULE
2
2
I
The name of this redefined attribute is only used to differentiate
class type from expert type rules.
Example of a class rule to recode an attribute named TYPE to a new attribute
named CLASSTYPE:
COLUMN
ITEM NAME
1 NUM_AUTHOR
3 LAST_UPD
11 NOTE
15 CLASSTYPE
16 TYPE
** REDEFINED ITEMS
1 CLASS_RULE
WIDTH OUTPUT
2
2
8
8
4
5
1
1
2
2
**
2
2
TYPE N.DEC
I
D
B
I
C
I
-
ALTERNATE NAME
INDEXED?
-
3.7 - Expert rule structure:
An expert rule is one that uses one or several input attributes to infer the
values of an output attribute.
COLUMN
ITEM NAME
WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME
INDEXED?
1 NUM_AUTHOR
2
2
I
Identification number of author of the occurrence of the rule.
3 LAST_UPD
8
8
D
Last update date of the occurrence of the rule.
11 NOTE
4
5
B
Pathname to an ASCII explanatory note file of the occurrence of the rule.
15
?
?
{ ?
(Not used at this time.)
<out_item>
?
?
?
Output attribute from the rule.
<out_item>.CL
1
1
C
Output confidence level attribute from the rule.
<in_item 1>
?
?
?
1st input attribute to the rule.
<in_item 2>
?
?
?
2nd input attribute to the rule.
-
...
? <in_item N>
?
?
?
Nth input attribute to the rule. }
** REDEFINED ITEMS **
1 EXPERT_RULE
2
2
I
The name of this redefined attribute is only used to differentiate
expert type from class type rules.
Example of an expert rule to infer an output attribute named GEOL from a set
of 2 attributes named DEPTH and TYPE:
COLUMN
ITEM NAME
1 NUM_AUTHOR
3 LAST_UPD
11 NOTE
15 GEOL
16 GEOL.CL
17 DEPTH
18 TYPE
** REDEFINED ITEMS
1 EXPERT_RULE
WIDTH OUTPUT
2
2
8
8
4
5
1
1
1
1
1
1
1
1
**
2
2
TYPE N.DEC
I
D
B
C
C
C
C
I
-
ALTERNATE NAME
INDEXED?
-
3.8 - Item coding and constraints:
<out_item>.CL (output attribute confidence level):
h
m
l
v
' '
NUM_AUTHOR
-1
1
...
99
LAST_UPD
01/01/00
NOTE
0
1
...
<out_item>
0 or ' '
High level of confidence.
Medium level of confidence.
Low level of confidence.
Very low level of confidence or no information.
A blank is output in the dataset for this item when no inference
can be made for some input value(s) because the occurrence is
missing from the rule. The blank must not figure in the rule
(all occurrences must have an explicit confidence level).
Unknown.
Unknown.
No note.
Note number 1 for current rule.
For expert rules, a 0 or blank (whether the output item is of
numerical or character type respectively) is output in the
dataset when no inference can be made for some input value(s)
because an appropriate occurrence is missing from the rule. If
this happens together with a blank output in the <out_item>.CL
item when running the rule, a warning is issued. It is thus a
good idea not to use 0 or blank as output values from a rule so
that it cannot make confusion with the case of a missing
occurrence.
3.9 - Rules data:
A rule holds a number of occurrences. Each occurrence holds the data for the set
of input values of a fact to which corresponds an output value and a confidence
level for that fact.
A "wild card" character (* = the star character) may be used that stands for
"any value".
Example:
IN1
*
a
a
a
*
*
b
b
b
If
If
If
If
IN2
*
*
x
y
x
y
*
x
y
OUT
n
1
2
3
4
5
6
7
8
a fact has the values
fact is (IN1=a,IN2=z)
fact is (IN1=c,IN2=x)
fact is (IN1=c,IN2=z)
OUT.CL
v
v
h
h
l
m
v
h
h
(IN1=a,IN2=x) then (OUT=2,CONF=h).
then (OUT=1,CONF=v).
then (OUT=4,CONF=l).
then (OUT=n,CONF=v).
Notice that a fact to which corresponds an exact matching occurrence will be put
in correspondence with this occurrence wherever the occurrence is positioned in
the rule. Therefore the order in which occurrences with no wild card(s) appear
has no significance to the program.
On the contrary, any fact which does not have an exactly matching occurrence in
the rule is put in correspondence (if possible) with the last "wild card"
matching occurrence encountered in the rule. For example (IN1=a,IN2=z) could
have been matched to (OUT=n,CONF=v), but another match was found later in the
rule (OUT=1,CONF=v) which was retained. Therefore the order in which occurrences
with wild card(s) appear is significant to the program. The user must feel this
as describing occurrences in a rule starting from the most general case
(IN1=*,IN2=*) to the most particular case (IN1=a,IN2=x).
When a fact does not find any matching occurrence in a rule it is left blank
(IN1=' ',IN2=' '). Having a fully "wild carded" occurrence (IN1=*,IN2=*) as a
header of a rule will "pick up" all facts for which no information can be
provided. For example (IN1=c,IN2=z). In this example the user controls the
output by providing a "no data" information (OUT=n,CONF=v), instead of letting
the program leave it blank (OUT=' ',CONF=' ').
4 - Project organization:
------------------------This project is independent from any other. This means that any data necessary
to this project is copied into the project's directory (e.g. the Soils
Geographical Database of the European Union version 2.1).
The main objects that can be found under the project's directory are:
PTRDBE_Readme
PTRDBE_Metadata
PTRDBE_Specif
PTRDBE_dictiona
Rules_xxx
A "first things first" short read-me file.
Overview of the subject.
Project specifications.
Project's dictionary.
The Pedotransfer Rule number xxx.
Download