The main objective of this study is to propose a framework that

advertisement
Like other disciplines of science, the finding of new information and
modification of existing knowledge advance paleontology. The process of
discovery of new information generates large volumes of data that can be
overwhelming if not properly stored and/or utilized. For example, the treatise on
invertebrate macrofossils edited by Raymond in 1959 blazed the trail for similar
works that came later. Many paleontological volumes provide
Description #1:
information of fossil specimens that have been formally named. Proximate,
acavate cyst,
In palynology, problems can arise with palynomorph
large (75-95 µm)
with irregular
classifications and interpretations because of
perforated parasutural
subjective nature due to human judgments and
crests, well developed
paratabulation
different levels of training. As a result,
(including parasulcal
paraplates), and
the same palynomorph can be
precingular
interpreted or classified differently,
Description #2:
archaeopyle.
Large proximate, cyst
resulting in junior synonyms and
with irregular
emended descriptions that can
perforated crests, well
developed
potentially confuse students and new
paratabulation.
researchers. It is important to
provide a framework to compose a
standardized description of each
taxon utilizing diverse observations
from various taxonomists.
40
µm
Expert
Expertinvolvement
involvementin
increating
creating
start/stop
start/stopword
wordlist
listcan
canimprove
improve
accuracy
accuracyof
ofresults
results
Ifecysta sp. 1
Ifecysta sp. 2
Ifecysta sp. 1
Ifecysta sp. 2
*
Diphyes sp.
Achomosphaera sp.
The main objective of this study is to propose a framework that utilizes text
The main
objective
this studyaistaxon
to propose
a framework
that utilizes
text
mining
techniques
inof
developing
description
recommendation
system.
mining
techniques
inintelligent
developing
a taxon descriptionto
recommendation
Text
mining
can apply
methods/algorithms
extract or mine system.
Text mining
can
apply intelligent
methods/algorithms
to extract
or mine
knowledge
and
meaningful
data patterns
from a large amount
of unstructured
knowledge
and meaningful
data patterns
from a large
of that
unstructured
texts
or documents
for decision-making.
Therefore,
it isamount
expected
texts or characteristics
documents for and
decision-making.
it is expected
that
common
features fromTherefore,
interpretations
done by different
commoncan
characteristics
and used
features
from interpretations
done by
different
scholars
be captured and
for standardized
descriptions
to minimize
scholars
be captured
andjudgment.
used for standardized descriptions to minimize
the
issue ofcan
subjective
human
the issue of subjective human judgment.
Descriptive terms can be used
for:
(1) determining fundamental
dimensions of a taxon group
(2) finding a target dinocyst
(3) clustering
Diphyes sp.
Achomosphaera sp.
* COL1, COL2 & COL3 represent the principle
components (SVD variables) of the
descriptive terms
By analyzing
different
descriptions
composed by
various scholars,
a list of descriptive
terms can be
generated and
used to develop a
Four different descriptions for the same dinocyst
more complete
(standardized)
description for an
existing or a new
dinocyst. As a result, the subjective
nature of dinocyst description (human
judgment or level of training) can be
minimized.
40
µm
DinoSys
Sample
Database
DinoSys
CHRONOS
User Input
…
Text
preprocessing
& Pattern
Identification
Module
Variable
Reduction
Module
Clustering
Module
Model
Creation
Module
Model
Evaluation
Module
Model
Selection
Module
Finding hidden textural patterns for potential
technology by grouping similar descriptions
Data (Documents) Cleansing
Text processing (Parsing, Stop words and
start words, Parts of speech, Stemming, and
Synonyms, Jargons, Abbreviations)
Term Frequency Matrix
Weighting Scheme
Variable Reduction & Transformation
SVD (Singular Value Decomposition):
Principal components decomposition.
Roll Up Term (use the n terms with the
largest term weights)
Input (independent variable): The SVD variables
Output (dependent variable): Clusters
Descriptive models: Regression, Neural
network, Decision tree, etc.
Model Evaluation & Model Selection
Descriptive
Model
Standardized
Taxon
Recommendation
Modeling Module
The
Thedescriptive
descriptiveterms
termsidentified
identified
from
fromthe
theclustering
clusteringanalysis
analysis
during
duringthe
thetext
textmining
miningprocess
process
provide
provideaacollection
collectionofofterms
termsthat
that
are
arecommonly
commonlyused
usedininthe
the
descriptions
descriptionsfor
forthe
thesamples
samplesinin
the
therespective
respectivecluster.
cluster.
1.1.Those
Thosedescriptive
descriptiveterms
termsare
are
analyzed
analyzedtotodetermine
determine
fundamental
fundamentaldimensions
dimensionsfor
foraa
taxon
taxongroup.
group.
2.2.Then,
Then,those
thosedescriptive
descriptiveterms
terms
and
andadditional
additionalinformation
information
gathered
gatheredduring
duringthe
theinvestigation
investigation
process,
process,with
withor
orwithout
withouthuman
human
intervention,
intervention,are
areused
usedtotosuggest
suggest
aabasic
basicset
setofofstandard
standardlexicon
lexicon
for
fordomain
domainexperts
expertstotodevelop
developaa
standardized
standardizedtaxon
taxondescription
description
recommendation.
recommendation.
Test the proposed framework using Dinosys Database (permission
Test
the granted)
proposed framework using Dinosys Database (permission
has
been
has been
granted)
Improve
result
accuracy with expert intervention in creating
Improve result
accuracy with expert intervention in creating
start/stop
word list
start/stop
word
list enhanced search engine that take free-form
Develop
a text
mining
Develop
a text mining
search
engine
that
take
free-form
text
(description,
terms,enhanced
key words,
etc.) input
and
then
intelligently
text (description,
terms,
key words,
etc.)search
input and
then intelligently
interact,
interpret, and
translate
a user’s
intention
into
interact, interpret,
and translate a user’s search intention into
suggestions
and recommendations
suggestions
and recommendations
Incorporate
image
and pattern matching to improve efficiency of
Incorporate
image and pattern matching to improve efficiency of
dinocyst
search
dinocyst search
The authors would like to express our gratitude to Dr. Jan Willem
The authors
would
to expressand
ourthe
gratitude
togroup
Dr. Janfor
Willem
Weegink
(NGTO,
Thelike
Netherlands)
Dinosys
Weegink (NGTO,
TheDinosys
Netherlands)
and to
thetest
Dinosys
grouptofor
permission
to use the
database
our ideas,
Dr. Lucy
permission
to use
Dinosys
database
to testand
our encouragement,
ideas, to Dr. Lucy
Edwards
at USGS
forthe
her
insightful
suggestions
at Cervato
USGS for
insightful
suggestions
and
encouragement,
toEdwards
Dr. Cinzia
at her
Iowa
State University
for her
expert
critique,
to Dr.
Cinzia
Cervato
University
for her
critique,
and
to Dr.
Martin
Head at
at Iowa
BrockState
University,
Canada
forexpert
his support.
and to Dr. Martin Head at Brock University, Canada for his support.
Download