[JS-226] Robinson's morphology is not indexed in JSword modules Created: 10/Jul/12 Updated: 08/Mar/13 Status: Project: Component/s: Affects Version/s: Fix Version/s: Reopened JSword o.c.jsword.index 1.6 Type: Reporter: Resolution: Labels: Remaining Estimate: Time Spent: Original Estimate: Bug Chris Burrell Unresolved None Not Specified Attachments: 1.7 Priority: Assignee: Votes: Critical DM Smith 0 Not Specified Not Specified Encoding.doc Encoding.doc Description Lucene is not told to index the morphology information rendering such searches impossible. Comments Comment by Chris Burrell [ 25/Oct/12 ] See pull request = Doesn't look right on github, but I'm pretty sure I selected the branch. Perhaps it's because it's branched off something that's ahead of the jsword... Comment by Chris Burrell [ 06/Feb/13 ] This has been pulled&merged: https://github.com/crosswire/jsword/pull/11 & https://github.com/crosswire/jsword/pull/10 Comment by Chris Burrell [ 06/Feb/13 ] Now in the build. see OSISUtil.getMorphologiesWithStrong() and LuceneIndex.FIELD_MORPHOLOGY Comment by DM Smith [ 09/Feb/13 ] The entry now is something like G2345@robinson:T-NPN There should not be robinson: in it. While I like the functionality of G2345@T-NPN, we need to think about this. Do we want to have both strong: and morph: ? Will just one suffice? If we have just one do we allow strong: and morph: searches? Or do we have a new name, e.g. lex: ? Comment by Chris Burrell [ 09/Feb/13 ] I think we do want to have both. One allows for exact matches on a strong number (strong: ) which is presumably faster than a trailing wildcard search. The morphology will most likely be using wildcards most of the times (e.g. all adjectives for this word). Comment by Chris Burrell [ 21/Feb/13 ] So the question, do we not want to have "robinson:" in the indexed field? If codes clash (unlikely I know), we wouldn't want to return the wrong results for some modules. The clash is less unlikely as soon as you introduce wildcards, e.g. G1000@ * A * (ignore spaces - JIRA changed to bold otherwise) or whatever the code is going to be. A could mean adverb. If we are going to change it, then what is your suggestion? morphField.substring(morphField.indexOf(':') + 1) Comment by DM Smith [ 21/Feb/13 ] We don't want robinson: any more than we want strong: w/in the other field. Same with strongMorph: used in the OT. BTW, the whole use of the morph field is not very useful. The least significant value of the field is in the most significant position. Wildcard searches won't be very helpful. I think this (how to make searching useful) is a good topic for sword-devel. Comment by Chris Burrell [ 21/Feb/13 ] I disagree in terms of wildcards. You need to use wildcards to search for all adjectives, or all plurals. It's a perfectly good thing to do: search for all verbs in the passage (regardless of their strong numbers) or search for all the times a particular word is used as a verb I use the morph field quite successfully in STEP (although in the browser, rather than in a search), to highlight plurals in bold, singular normal font. And then I highlight masculines and feminines in blue / red. I use regular expressions, but a carefully crafted query using wildcards could equally do the job. Comment by Chris Burrell [ 21/Feb/13 ] The first of the two points, you'd obviously be looking for terms rather than verses. The second is easy enough to do looking for verses and using a custom highlighter or something like that. But they key thing here, is STEP will have a feature where we allow the user to choose a word (by strong number effectively) and then select which grammar points he wants to search across. Then give him the resulting verses, highlighting the strong number. So a user may want mercy, used as a noun and in the singular mode. Comment by DM Smith [ 21/Feb/13 ] Really? How would you search for adverbs using wildcards? Wouldn't that also pull back adjectives? How do you find plural verbs? How about first person voice (this one is tough as it is the unmentioned default)? How do you find all uses of a word as a verb in aorist form? Comment by David Instone-Brewer [ 26/Feb/13 ] Robinson codes are unnecessarily difficult to search, and it would be possible to design something easier for computer searches, but they are consistent in their complexity and can be searched by RegEx. The key to them is to distinguish the following types of words: ALL X - Tense Voice Mood - Person Case Number Gender (- Extra) Participles X - Tense Voice Mood - Case Number Gender (- Extra) Other Verbs X - Tense Voice Mood - Person Number (- Extra) 1+2 Pronouns X - Person Case Number (- Extra) Refl+Poss Pn X - Person Case Number Gender (- Extra) Other Pronouns X - Case Number Gender (- Extra) Art, Nouns + Adj X - Case Number Gender (- Extra) (Uggg - the layout of this is lost - so line up the columns in your head) In RegEx: Verbs [V][123][NADG][SP][MFN][\w-]* Others [V]-[123][NADG][SP][MFN][\w-]* CONJ, PRT etc [A-Z][A-Z][A-Z][\w-]* For searches this can be summarised as: 1) Participles ie any that start with "V-" and include "P-" 2) Other verbs ie any that start with "V-" but no "P-" 3) Nouns, articles, adjectives and general pronouns ie starting "N-", "A-", "P-" followed by a letter, "D-" 3) Person Pronouns (1st+2nd+Reflex.+Poss) ie like nouns but add Person (ie a number) after the first hyphen (and sometimes have Gender at the end) 4) Unparsed words ie ADVerbs, PARTicles, PREPositions - ie anything that doesn't have a hyphen as the second character Rules for searches: Tense is the 1st letter after "V-" Voice is the 2nd letter after "V-" Mood is the 3rd letter after "V-" Person is the a number after "^V-" Case is the 1st letter after "^V-" Number is any S or P after "^V-" Gender is the 3rd letter after "^V-" (where "^V" is any letter except V) RegEx examples: Aorist tense = V-[2]A[\w-] all tenses = V-[2][PIFARLX][\w-] Active voice = V-[2][A-Z]A[\w-] all voices = V-[2][A-Z][AMPEDONQX][\w-] iNfinitive mood = V-[2][A-Z][A-Z]N[\w-] all moods = V-[2][A-Z][A-Z][ISOMNPR][\w-] Second person = [\w-][^V]-2[\w-] all persons = [\w-][^V]-[123][\w-] Nominative case = [\w-][^V]-\d*N[\w-] all cases = [\w-][^V]-\d[NADG][\w-]* Plural number = [\w-][^V]-\w*P[\w-] all numbers = [\w-][^V]-\w[SP][\w-]* Neuter gender = [\w-][^V]-\d[A-Z][A-Z]N[\w-]* all genders = [\w-][^V]-\d[A-Z][A-Z][MFN][\w-]* Comment by DM Smith [ 28/Feb/13 ] Looking at the code now. There's something about it that isn't quite right. In the OT we don't have morphology codes that have semantic meaning. In fact, the codes (strongsMorph aka "thayers") that are used are placeholders for the future. We don't have a dictionary for those codes. I'd love it if David I-B, Daniel Owens and Chris Little would work together to come up with a meaningful code pattern that can be used. Chris has talked about a code that would not require an external dictionary. As long as we have a mapping from existing (robinson and "thayer") codes to the new one, I'll do the work in the KJV to update it. But regarding the code, we should put the morph codes in to the searchable morph field regardless if they are Robinsons or strongsMorph. The form of SN@MC (Strong's Number@Morph Code) is useful if the Morph Code has semantic meaning (as above). The other thing, is that the lemma, morph and src attribute values form parallel arrays. Here are 3 examples from the upcoming KJV of the longest ones: <w src="2 3 4 5" lemma="strong:G3956 strong:G3739 strong:G5100 strong:G302 tr:παν tr:ο tr:τι tr:αν" morph="robinson:A-ASN robinson:R-ASN robinson:X-ASN robinson:PRT">whatsoever</w> <w src="15 16 17 18" lemma="strong:G3588 strong:G4286 strong:G3588 strong:G740 tr:η tr:προθεσις tr:των tr:αρτων" morph="robinson:T-NSF robinson:N-NSF robinson:T-GPM robinson:N-GPM">the shewbread</w> <w src="10 13 14 15" lemma="strong:G3756 strong:G1519 strong:G3588 strong:G165 tr:ουκ tr:εις tr:τον tr:αιωνα" morph="robinson:PRT-N robinson:PREP robinson:T-ASM robinson:NASM">never</w> It is not a valid assumption that they always form parallel arrays. It took a lot of effort to clean up the KJV NT to make it so. Here is an example from the KJV OT: <w morph="strongMorph:TH8799 strongMorph:TH8675 strongMorph:TH8686" lemma="strong:H05749 strong:H05749">What thing shall I take to witness</w> There's no reason to think other modules are error free. See an example from ABP here: http://www.crosswire.org/tracker/browse/MOD-245 This has Strong's Numbers but they are pretty goofy. So I think we should do the following: Have one or more fields with SN, MC and SN@MC. The construction of SN@MC needs to be best effort, pairing the first Strong's Number with the first morph code, the second with the second and so forth until no more pairs can be formed. The question I have is do we really need more than one field if the codes are not overlapping? And I wonder about the - in the code. I was under the impression that the - formed a word break. It seems to me that we need to normalize the code so that it can be searched well. Similar question regarding @ does it get stripped out by Lucene? Do we need to protect it? I'm inclined to say that the easiest path is 3 fields: strong:SN, morph:MC and lex:SN@MC. Comment by David Instone-Brewer [ 04/Mar/13 ] There's a few different issues here: A) Greek OT: 1) The ABP problems with OT Greek codes (examples at MOD-245) are not easy to fix, because this is a copyright text. 2) The problems with OT Greek vocabulary stem from the fact that the OT has words which aren't in Strong's lexicon, cos this was made for the NT. Ideally the lexicon should be extended in a way that is backwardly compatible - perhaps like the NASB does for Hebrew. 3) There is no OT Greek morphology in the ABP module (or any other OSIS module as far as I know). STEP has permission to use the CCAT data for Greek OT lexicon and morphology (I'm not sure if we needed to ask, but we did). Implementing this is not straightforward, but it will happen. B) Hebrew OT: 1) The Hebrew morphology codes are fairly straightforward, but not full. They only tell us the Stem and Mood for verbs, and supply no information about nouns, particles etc. 2) There are some additional codes marked as morphology which are actually notes such as "the Qere is followed here instead of the Ketiv") 3) The KJV module, as coded at present, cannot be used to create arrays, because there are frequently more lemmas than morphology codes in the same tag, with no indication as to which morphology goes with which lemma. This is due to problems (1) and (2). 4) About 6000 words in the OT have no lemma tags. These issues are now all fixed - the work was done by the Open Scripture guys and myself - and I'm in the process of checking it. It would be great to indegrate this into the KJV module. Hebrew Morphology: The Open Scripture guys have have a very well thought-out system of morphology codes and are creating a crowd-sourcing tool to get this implemented. However, this is probably a very long-term project. I am in the process of adding a much more limited set of morphology codes in the mean-time, ie Stem, Mood, person, number, gender - for all words. The codes will not be perfect, but they will supply something while the Open Scripture morphology is being developed. Question: Given that the Robinson codes are 'standard', do you think the OT Hebrew codes should try to emulate them, or should we start again with something which is easier to search and process? I was thinking of something where every word would have the same morphology layout, eg: VQi3SF meaning Verb: Qal Imperative 3rd singular feminine or: N---PM meaning Noun: plural masculine The idea is that each morphology would have standard place markers for Stem, Mood, person, number, gender so (for example) plurals would always have "P" as the 5th character. Comment by DM Smith [ 04/Mar/13 ] I'm not sure I'm the best to answer the OT Hebrew Codes question. I took a 7 week crash course in Hebrew, but I don't remember enough of it to weigh in. I'd defer to Chris L, our resident linguist, or to you or some other Hebrew scholar. But I'll give an end user kind of answer: The pattern should serve two goals well. The casual user who needs to decode it as a glance and the advanced user who will form search requests. For someone that has passing knowledge of Greek, the Robinson codes are fairly easy to learn and quick to read. As we allow the user to show/hide these values, this is important. I'd want the same for people with a working knowledge of Hebrew. If I were to pick, I would not like to count to see that the P is in the 5-th character and that for nouns, I have to see 3 dashes. A dash as a separator works fine visually, but not as a placeholder. Some fonts will show it as a single long dash. So I'd suggest a different order, such that their would be as few dashes as possible. No need to have trailing dashes. And perhaps a / or another character that visually will never join an adjacent character and will not need special processing when showing in HTML/XML. The other part is that the end user will perhaps be forming search queries manually. In lucene, the don't care matcher for a single character is a dot (or maybe it is a question mark. I forget.) So the user would type 'N...PM' to search. Having a different order would give NPM* or perhaps just NPM, if we don't right pad the code. But that's just a first response, I might find that I quickly adjust to 3 dashes in the middle. One of the basic human user interface principles is to order the material from the most useful to the least useful, from left to right, from top down. I think such would apply to this as well (though not top down). Comment by David Instone-Brewer [ 05/Mar/13 ] Great idea about changing the order. Nouns have number, gender Particles usually have nothing verbs can have everything Adjectives may have gender and number Adverbs usually have nothing Object marker have number and perhaps gender Pronouns have number, gender & person They all have a Type (ie 'Noun','verb' etc) And they all have a Language (Hebrew, Aramaic or personal Name) So to minimise place markers the order would be: Language Type, Number, Gender Person, and then verbal details. For verbs, we could simply record Stem (eg Qal, Hiphil etc) but there are 37 of these in Strong's system, and more in others, and it is more meaningful to interpret the Stems as Voice and Mood So, we end up with: Language, Type, Number, Gender, Person, Voice, Mood, State/Aspect eg HNSM = Hebrew Noun Singular Masculine Aa = Aramaic adverb NLSF = Name of Location Singular Feminine HVPM2PIC = Hebrew Verb Plural Masculine 2nd person Passive Intensive Consecutive imperfective I've attached a table of all the abbreviations ("Encoding.doc") Comment by DM Smith [ 06/Mar/13 ] Regarding using regular expressions to do a search: Lucene search syntax is not regular expression. It is more like unix command-line globbing. I haven't seen regular expression support in a contrib to Lucene, but that doesn't mean it is not there. But if not, to support regular expressions, we'll need to intercept the query and pick out the regular expression and use the regular expression to do our own search over our own store or the term dictionary. Comment by David Instone-Brewer [ 07/Mar/13 ] The RegEx expressions were more complicated than I had thought they would be. Is it time to redesign the Robinson Codes? They aren't particularly human-friendly or machine-friendly I think the latter is more important because ideally people won't see the actual coding. Comment by Chris Burrell [ 07/Mar/13 ] Agreed - showing the codes to the user, should be a last resort thing, as it implies that they need to learn the new system. Comment by Chris Burrell [ 07/Mar/13 ] It seems Lucene has some support for Regular Expressions anyway: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/contribregex/org/apache/lucene/search/regex/package-summary.html Comment by DM Smith [ 08/Mar/13 ] If we can create a mapping for Robinson codes to something that is better (human readable and easy to search), then we can use the mapping w/in JSword to provide a better user experience. Basic thought, the user would see the new codes or a decoding of these codes into their language (or the default, if there's no such translation). They can search these codes either directly or via a wizard (what is done would be a front-end choice). It may be that the underlying module uses the old codes. That'd be ok. Not ideal. The search would reverse the mapping going from the new codes to the old codes and use that to search the module. Likewise, when presenting the module, the old codes would be replace with the new codes. This would be a process of normalization, which we do currently for Strong's numbers. We may want to explore the idea of a module sidecar. On various occasions, I've wanted finer grain information regarding a module. Basically, we'd maintain a separate conf for the modules. It'd contain information regarding thing like: user provided font info, unlock keys, type of Strong's numbers per testament, type of morphology per testament, .... Any program can set a value into the sidecar. This info would be read into BookMetadata and would be available for all programs. If a program doesn't know what to do with it, it'd ignore it. It would be good to communicate and document these new values. Automatic behavior that's added to JSword would need to be discussed. Generated at Tue Feb 09 07:49:28 MST 2016 using JIRA 6.2#6252sha1:aa343257d4ce030d9cb8c531be520be9fac1c996.