[#JS-226] Robinson`s morphology is not indexed in JSword modules

advertisement
[JS-226] Robinson's morphology is not indexed in JSword modules Created:
10/Jul/12 Updated: 08/Mar/13
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Reopened
JSword
o.c.jsword.index
1.6
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Bug
Chris Burrell
Unresolved
None
Not Specified
Attachments:
1.7
Priority:
Assignee:
Votes:
Critical
DM Smith
0
Not Specified
Not Specified
Encoding.doc
Encoding.doc
Description
Lucene is not told to index the morphology information rendering such searches impossible.
Comments
Comment by Chris Burrell [ 25/Oct/12 ]
See pull request = Doesn't look right on github, but I'm pretty sure I selected the branch. Perhaps
it's because it's branched off something that's ahead of the jsword...
Comment by Chris Burrell [ 06/Feb/13 ]
This has been pulled&merged: https://github.com/crosswire/jsword/pull/11 &
https://github.com/crosswire/jsword/pull/10
Comment by Chris Burrell [ 06/Feb/13 ]
Now in the build. see OSISUtil.getMorphologiesWithStrong() and
LuceneIndex.FIELD_MORPHOLOGY
Comment by DM Smith [ 09/Feb/13 ]
The entry now is something like G2345@robinson:T-NPN
There should not be robinson: in it.
While I like the functionality of G2345@T-NPN, we need to think about this. Do we want to
have both strong: and morph: ? Will just one suffice?
If we have just one do we allow strong: and morph: searches? Or do we have a new name, e.g.
lex: ?
Comment by Chris Burrell [ 09/Feb/13 ]
I think we do want to have both. One allows for exact matches on a strong number (strong: )
which is presumably faster than a trailing wildcard search. The morphology will most likely be
using wildcards most of the times (e.g. all adjectives for this word).
Comment by Chris Burrell [ 21/Feb/13 ]
So the question, do we not want to have "robinson:" in the indexed field? If codes clash
(unlikely I know), we wouldn't want to return the wrong results for some modules.
The clash is less unlikely as soon as you introduce wildcards, e.g. G1000@ * A * (ignore spaces
- JIRA changed to bold otherwise) or whatever the code is going to be. A could mean adverb.
If we are going to change it, then what is your suggestion?
morphField.substring(morphField.indexOf(':') + 1)
Comment by DM Smith [ 21/Feb/13 ]
We don't want robinson: any more than we want strong: w/in the other field.
Same with strongMorph: used in the OT.
BTW, the whole use of the morph field is not very useful. The least significant value of the field
is in the most significant position. Wildcard searches won't be very helpful.
I think this (how to make searching useful) is a good topic for sword-devel.
Comment by Chris Burrell [ 21/Feb/13 ]
I disagree in terms of wildcards. You need to use wildcards to search for all adjectives, or all
plurals.
It's a perfectly good thing to do:


search for all verbs in the passage (regardless of their strong numbers)
or search for all the times a particular word is used as a verb
I use the morph field quite successfully in STEP (although in the browser, rather than in a
search), to highlight plurals in bold, singular normal font. And then I highlight masculines and
feminines in blue / red. I use regular expressions, but a carefully crafted query using wildcards
could equally do the job.
Comment by Chris Burrell [ 21/Feb/13 ]
The first of the two points, you'd obviously be looking for terms rather than verses. The second
is easy enough to do looking for verses and using a custom highlighter or something like that.
But they key thing here, is STEP will have a feature where we allow the user to choose a word
(by strong number effectively) and then select which grammar points he wants to search across.
Then give him the resulting verses, highlighting the strong number.
So a user may want mercy, used as a noun and in the singular mode.
Comment by DM Smith [ 21/Feb/13 ]
Really? How would you search for adverbs using wildcards? Wouldn't that also pull back
adjectives?
How do you find plural verbs? How about first person voice (this one is tough as it is the
unmentioned default)?
How do you find all uses of a word as a verb in aorist form?
Comment by David Instone-Brewer [ 26/Feb/13 ]
Robinson codes are unnecessarily difficult to search, and it would be possible to design
something easier for computer searches, but they are consistent in their complexity and can be
searched by RegEx.
The key to them is to distinguish the following types of words:
ALL X - Tense Voice Mood - Person Case Number Gender (- Extra)
Participles X - Tense Voice Mood - Case Number Gender (- Extra)
Other Verbs X - Tense Voice Mood - Person Number (- Extra)
1+2 Pronouns X - Person Case Number (- Extra)
Refl+Poss Pn X - Person Case Number Gender (- Extra)
Other Pronouns X - Case Number Gender (- Extra)
Art, Nouns + Adj X - Case Number Gender (- Extra)
(Uggg - the layout of this is lost - so line up the columns in your head)
In RegEx:
Verbs [V][123][NADG][SP][MFN][\w-]*
Others [V]-[123][NADG][SP][MFN][\w-]*
CONJ, PRT etc [A-Z][A-Z][A-Z][\w-]*
For searches this can be summarised as:
1) Participles
ie any that start with "V-" and include "P-"
2) Other verbs
ie any that start with "V-" but no "P-"
3) Nouns, articles, adjectives and general pronouns
ie starting "N-", "A-", "P-" followed by a letter, "D-"
3) Person Pronouns (1st+2nd+Reflex.+Poss)
ie like nouns but add Person (ie a number) after the first hyphen (and sometimes have Gender at
the end)
4) Unparsed words
ie ADVerbs, PARTicles, PREPositions - ie anything that doesn't have a hyphen as the second
character
Rules for searches:
Tense is the 1st letter after "V-"
Voice is the 2nd letter after "V-"
Mood is the 3rd letter after "V-"
Person is the a number after "^V-"
Case is the 1st letter after "^V-"
Number is any S or P after "^V-"
Gender is the 3rd letter after "^V-"
(where "^V" is any letter except V)
RegEx examples:







Aorist tense = V-[2]A[\w-]
all tenses = V-[2][PIFARLX][\w-]
Active voice = V-[2][A-Z]A[\w-]
all voices = V-[2][A-Z][AMPEDONQX][\w-]
iNfinitive mood = V-[2][A-Z][A-Z]N[\w-]
all moods = V-[2][A-Z][A-Z][ISOMNPR][\w-]
Second person = [\w-][^V]-2[\w-]
all persons = [\w-][^V]-[123][\w-]
Nominative case = [\w-][^V]-\d*N[\w-]
all cases = [\w-][^V]-\d[NADG][\w-]*
Plural number = [\w-][^V]-\w*P[\w-]
all numbers = [\w-][^V]-\w[SP][\w-]*
Neuter gender = [\w-][^V]-\d[A-Z][A-Z]N[\w-]*
all genders = [\w-][^V]-\d[A-Z][A-Z][MFN][\w-]*
Comment by DM Smith [ 28/Feb/13 ]
Looking at the code now. There's something about it that isn't quite right.
In the OT we don't have morphology codes that have semantic meaning. In fact, the codes
(strongsMorph aka "thayers") that are used are placeholders for the future. We don't have a
dictionary for those codes. I'd love it if David I-B, Daniel Owens and Chris Little would work
together to come up with a meaningful code pattern that can be used. Chris has talked about a
code that would not require an external dictionary. As long as we have a mapping from existing
(robinson and "thayer") codes to the new one, I'll do the work in the KJV to update it.
But regarding the code, we should put the morph codes in to the searchable morph field
regardless if they are Robinsons or strongsMorph.
The form of SN@MC (Strong's Number@Morph Code) is useful if the Morph Code has
semantic meaning (as above).
The other thing, is that the lemma, morph and src attribute values form parallel arrays. Here are
3 examples from the upcoming KJV of the longest ones:
<w src="2 3 4 5" lemma="strong:G3956 strong:G3739 strong:G5100 strong:G302 tr:παν tr:ο
tr:τι tr:αν" morph="robinson:A-ASN robinson:R-ASN robinson:X-ASN
robinson:PRT">whatsoever</w>
<w src="15 16 17 18" lemma="strong:G3588 strong:G4286 strong:G3588 strong:G740 tr:η
tr:προθεσις tr:των tr:αρτων" morph="robinson:T-NSF robinson:N-NSF robinson:T-GPM
robinson:N-GPM">the shewbread</w>
<w src="10 13 14 15" lemma="strong:G3756 strong:G1519 strong:G3588 strong:G165 tr:ουκ
tr:εις tr:τον tr:αιωνα" morph="robinson:PRT-N robinson:PREP robinson:T-ASM robinson:NASM">never</w>
It is not a valid assumption that they always form parallel arrays. It took a lot of effort to clean
up the KJV NT to make it so. Here is an example from the KJV OT:
<w morph="strongMorph:TH8799 strongMorph:TH8675 strongMorph:TH8686"
lemma="strong:H05749 strong:H05749">What thing shall I take to witness</w>
There's no reason to think other modules are error free. See an example from ABP here:
http://www.crosswire.org/tracker/browse/MOD-245
This has Strong's Numbers but they are pretty goofy.
So I think we should do the following:
Have one or more fields with SN, MC and SN@MC.
The construction of SN@MC needs to be best effort, pairing the first Strong's Number with the
first morph code, the second with the second and so forth until no more pairs can be formed.
The question I have is do we really need more than one field if the codes are not overlapping?
And I wonder about the - in the code. I was under the impression that the - formed a word break.
It seems to me that we need to normalize the code so that it can be searched well.
Similar question regarding @ does it get stripped out by Lucene? Do we need to protect it?
I'm inclined to say that the easiest path is 3 fields: strong:SN, morph:MC and lex:SN@MC.
Comment by David Instone-Brewer [ 04/Mar/13 ]
There's a few different issues here:
A) Greek OT:
1) The ABP problems with OT Greek codes (examples at MOD-245) are not easy to fix,
because this is a copyright text.
2) The problems with OT Greek vocabulary stem from the fact that the OT has words which
aren't in Strong's lexicon, cos this was made for the NT. Ideally the lexicon should be extended
in a way that is backwardly compatible - perhaps like the NASB does for Hebrew.
3) There is no OT Greek morphology in the ABP module (or any other OSIS module as far as I
know).
STEP has permission to use the CCAT data for Greek OT lexicon and morphology (I'm not sure
if we needed to ask, but we did). Implementing this is not straightforward, but it will happen.
B) Hebrew OT:
1) The Hebrew morphology codes are fairly straightforward, but not full.
They only tell us the Stem and Mood for verbs, and supply no information about nouns,
particles etc.
2) There are some additional codes marked as morphology which are actually notes such as "the
Qere is followed here instead of the Ketiv")
3) The KJV module, as coded at present, cannot be used to create arrays, because there are
frequently more lemmas than morphology codes in the same tag, with no indication as to which
morphology goes with which lemma. This is due to problems (1) and (2).
4) About 6000 words in the OT have no lemma tags.
These issues are now all fixed - the work was done by the Open Scripture guys and myself - and
I'm in the process of checking it.
It would be great to indegrate this into the KJV module.
Hebrew Morphology:
The Open Scripture guys have have a very well thought-out system of morphology codes and
are creating a crowd-sourcing tool to get this implemented. However, this is probably a very
long-term project.
I am in the process of adding a much more limited set of morphology codes in the mean-time, ie
Stem, Mood, person, number, gender - for all words.
The codes will not be perfect, but they will supply something while the Open Scripture
morphology is being developed.
Question:
Given that the Robinson codes are 'standard', do you think the OT Hebrew codes should try to
emulate them, or should we start again with something which is easier to search and process?
I was thinking of something where every word would have the same morphology layout,
eg: VQi3SF meaning Verb: Qal Imperative 3rd singular feminine
or: N---PM meaning Noun: plural masculine
The idea is that each morphology would have standard place markers for Stem, Mood, person,
number, gender
so (for example) plurals would always have "P" as the 5th character.
Comment by DM Smith [ 04/Mar/13 ]
I'm not sure I'm the best to answer the OT Hebrew Codes question. I took a 7 week crash course
in Hebrew, but I don't remember enough of it to weigh in. I'd defer to Chris L, our resident
linguist, or to you or some other Hebrew scholar.
But I'll give an end user kind of answer:
The pattern should serve two goals well. The casual user who needs to decode it as a glance and
the advanced user who will form search requests.
For someone that has passing knowledge of Greek, the Robinson codes are fairly easy to learn
and quick to read. As we allow the user to show/hide these values, this is important. I'd want the
same for people with a working knowledge of Hebrew.
If I were to pick, I would not like to count to see that the P is in the 5-th character and that for
nouns, I have to see 3 dashes. A dash as a separator works fine visually, but not as a
placeholder. Some fonts will show it as a single long dash. So I'd suggest a different order, such
that their would be as few dashes as possible. No need to have trailing dashes. And perhaps a /
or another character that visually will never join an adjacent character and will not need special
processing when showing in HTML/XML.
The other part is that the end user will perhaps be forming search queries manually. In lucene,
the don't care matcher for a single character is a dot (or maybe it is a question mark. I forget.) So
the user would type 'N...PM' to search. Having a different order would give NPM* or perhaps
just NPM, if we don't right pad the code.
But that's just a first response, I might find that I quickly adjust to 3 dashes in the middle.
One of the basic human user interface principles is to order the material from the most useful to
the least useful, from left to right, from top down. I think such would apply to this as well
(though not top down).
Comment by David Instone-Brewer [ 05/Mar/13 ]
Great idea about changing the order.
Nouns have number, gender
Particles usually have nothing
verbs can have everything
Adjectives may have gender and number
Adverbs usually have nothing
Object marker have number and perhaps gender
Pronouns have number, gender & person
They all have a Type (ie 'Noun','verb' etc)
And they all have a Language (Hebrew, Aramaic or personal Name)
So to minimise place markers the order would be:
Language Type, Number, Gender Person, and then verbal details.
For verbs, we could simply record Stem (eg Qal, Hiphil etc)
but there are 37 of these in Strong's system, and more in others,
and it is more meaningful to interpret the Stems as Voice and Mood
So, we end up with:
Language, Type, Number, Gender, Person, Voice, Mood, State/Aspect
eg
HNSM = Hebrew Noun Singular Masculine
Aa = Aramaic adverb
NLSF = Name of Location Singular Feminine
HVPM2PIC = Hebrew Verb Plural Masculine 2nd person Passive Intensive Consecutive
imperfective
I've attached a table of all the abbreviations ("Encoding.doc")
Comment by DM Smith [ 06/Mar/13 ]
Regarding using regular expressions to do a search:
Lucene search syntax is not regular expression. It is more like unix command-line globbing. I
haven't seen regular expression support in a contrib to Lucene, but that doesn't mean it is not
there.
But if not, to support regular expressions, we'll need to intercept the query and pick out the
regular expression and use the regular expression to do our own search over our own store or the
term dictionary.
Comment by David Instone-Brewer [ 07/Mar/13 ]
The RegEx expressions were more complicated than I had thought they would be.
Is it time to redesign the Robinson Codes?
They aren't particularly human-friendly or machine-friendly
I think the latter is more important because ideally people won't see the actual coding.
Comment by Chris Burrell [ 07/Mar/13 ]
Agreed - showing the codes to the user, should be a last resort thing, as it implies that they need
to learn the new system.
Comment by Chris Burrell [ 07/Mar/13 ]
It seems Lucene has some support for Regular Expressions anyway:
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/contribregex/org/apache/lucene/search/regex/package-summary.html
Comment by DM Smith [ 08/Mar/13 ]
If we can create a mapping for Robinson codes to something that is better (human readable and
easy to search), then we can use the mapping w/in JSword to provide a better user experience.
Basic thought, the user would see the new codes or a decoding of these codes into their
language (or the default, if there's no such translation). They can search these codes either
directly or via a wizard (what is done would be a front-end choice).
It may be that the underlying module uses the old codes. That'd be ok. Not ideal. The search
would reverse the mapping going from the new codes to the old codes and use that to search the
module. Likewise, when presenting the module, the old codes would be replace with the new
codes. This would be a process of normalization, which we do currently for Strong's numbers.
We may want to explore the idea of a module sidecar. On various occasions, I've wanted finer
grain information regarding a module. Basically, we'd maintain a separate conf for the modules.
It'd contain information regarding thing like: user provided font info, unlock keys, type of
Strong's numbers per testament, type of morphology per testament, .... Any program can set a
value into the sidecar. This info would be read into BookMetadata and would be available for all
programs. If a program doesn't know what to do with it, it'd ignore it. It would be good to
communicate and document these new values. Automatic behavior that's added to JSword
would need to be discussed.
Generated at Tue Feb 09 07:49:28 MST 2016 using JIRA 6.2#6252sha1:aa343257d4ce030d9cb8c531be520be9fac1c996.
Download