Malformed interlinear layout related to \qt

jklassen · July 13, 2023, 4:45pm

Supporting an interlinear layout (Greek + Urdu glosses). In some cases (not all), poetic texts which have some text marked with \qt are malformed in the output from PTXPrint.

For example, in the following section from MRK 11:9:

In Paratext, the Greek text is marked as follows:

In the interlinearizer window, it appears as follows:

In the layout from PTXPrint:

There are other locations where the \q1, \q2 and \qt …\qt* combinations seem to be working OK.

I’ve tried to investigate what might be the cause – but being an interlinear, some of the process for how PTXPrint merges the texts and performs the layout is somewhat opaque/unclear. I could use some guidance about how to investigate this, or what might be going wrong.

Any help appreciated.

mjpenny · July 14, 2023, 12:39am

@jklassen Thanks for what you’re doing to support other PTXprint users as they tackle some interesting issues with formatting of RTL + LTR interlinear texts.

I don’t yet have any answers for you on this question, and both the more technical developers are currently either in transition, or on vacation right now; so it might be a few more days to get a response from them. However, to help them dig deeper, I know it would be easiest if they have an Archive file to work from, so please go ahead and create that (from the Help tab) and then send the resulting .zip file to the ptxprint_support@sil.org address. I’ll make sure at least one of them gets to dig deeper for you when they are back online.

jklassen · July 14, 2023, 5:22pm

Thanks, Mark! I’ve sent the archive.

David_Gardner · July 20, 2023, 9:52am

Hi, When I try to run the archive, I’m getting this error report:

Which isn’t a message you can ignore. Thus, for instance, I see that at 5:23, this happens:

The problem is that without the information that Paratext saves when you give approval to the interlinears, there’s just not enough information for PTXprint to try to synchronise the two lines, so it gets very confused.
I expect that if you resolve that, then your layout problems will go away.

@mjpenny Can I suggest that the error message get expanded to say something like ‘Failure to approve interlinear texts will result in layout errors’

jklassen · July 20, 2023, 1:19pm

Hello David,

Thank you very much for looking at this issue.

Yes - I confess that I knew about these errors. (I’m half way around the world from the team working on this, and it’s not always quick to get a fix to the text.) I did try to evaluate what was happening in the layout before inquiring. It seemed to me that the result of the unapproved references was formatting as your screenshot shows – no interlinear text in the reported verse. In the case of the texts marked with \qt, the rendering problem is rather different than what happens in unapproved verses, and happens in the same manner in multiple \qt locations, even when the rest of the text around these locations is rendered OK.

In any case, I have received a text from the team which does not have any unapproved verses (only 1 intentionally missing verse in the Greek and Urdu text at 11:26, which PTXPrint still reports as unapproved). The same rendering trouble happens at 11:9, 12:10, 12:36, 14:27 and 14:62, which are all \qt locations.

I’ll send an updated archive, for whenever there is space to look again.

Jeff

David_Gardner · July 21, 2023, 12:22pm

I forgot to mention, I did look for \qts elsewhere in the text, and found quite a few places where the interlinear was merged correctly

jklassen · July 21, 2023, 1:02pm

Ok. Well, I was not meaning to claim that every \qt was not working, just that where it was not, there is a \qt.

I’m not sure how to get to the cause of the failed interlinear layout, David. If it’s a fault in the text or the configuration, I have not been able to identify it, yet. I’ll keep looking myself.

Martin_Hosken · July 26, 2023, 10:07am

I’m not sure what is going wrong here, but the first line of verse 9 ends with a space. Removing the space fixes everything up nicely. And looking at the interlinear file, I suspect that Paratext is magically munging that space away. I’m sorry but trying to guess the internal magic that PT does when it considers a character position in a verse is a pain. I do hope we can sort this out for the future.

jklassen · July 27, 2023, 3:32pm

Martin – thank you. Excellent. I’ve edited the project text to remove spaces at these locations, and the rendering is now OK, as you indicated.

I doubt I would have ever identified this cause. Who knows …

As I think you’re suggesting - perhaps PTXPrint could be more tolerant of this, somehow, in the future.

David_Gardner · July 27, 2023, 8:56pm

The problem is, I believe, what Paratext’s XML specifies and what it doesn’t. It specifies chunk of text, gloss-id (cross-referencing the Lexicon.xml file) and the position of the original text in some representatoin of the input stream. It does not preserve the order of occurrence. The verses are not in sequence, either.

A given word or phrase may be glossed as word(s) or stem and morphemes, (of course with different glosses) and the only way to tell which homograph is which or which level of glossing the user expects is via the positioning data. The glossing file seems to ignore case, punctuation and SFM marks, and probably other things too.

Thus if the glossing file matches the input, then assuming that Martin has understood the undocumented way in which Paratext counts characters in its internal representation of the file, then the right thing to do is cut out the word/phrase unit by position and replace with the gloss, possibly doing some case transposition as appropriate.

If the verse has changed, then that doesn’t work. If the internal representation of the file is wrong, then it doesn’t work.

Any unconstrained search and replace is going to get false matches on homographs, so that can’t be used as a general approach. There could be some guessing done, looking for fuzzy-matches plus or minus a few characters, breaking at word-gaps, for instance. That would cost a fairly large investment of programmer time, but there could be problems. If the spacing has changed because someone has reordered the words in the verse, the homograph offset by 3 characters might not be the same word. E.g. Romani has fairly free word order:

 O    Del  te  del tut haro
(the) God SUBJ give you grace
"May God give you grace"

vs

 Te  del   o   Del  tut haro
SUBJ give (the) God you grace
"May God give you grace"

PTXprint can’t go into the linguistics rules of the language. Not even Paratext remembers / asks what parts of speech things are. The only reliable way to get trustworthy interlinear data is for the user to approve glosses.