RTL verse references printed in wrong order (in Modules)

Jim_Smith · October 3, 2023, 6:19am

When printing a module for an Arabic script based language, I noticed that the verse references are printed incorrectly. Here is the entry from my module SFM file:

\s God creates the world
\r ($(GEN 1:1-27))
\ref GEN 1:1-27

Here is the output I got from PTXprint 2.3.45:

Screenshot 2023-10-03 090955

(Literally: “1:1-27 Genesis” — the chapter/verse part is printed just like in a LTR language.)

But from right to left, it should be: book name, space, chapter number, colon, initial verse, hyphen, final verse

Now that I look for it, I see the same error in parallel passage headings.

David_Gardner · October 3, 2023, 7:52am

There are quite a few possibilities of where things are going wrong here:

Is the document actually set RTL, or is the \s actually what the module says? We do have the possibility to do what I call a “series diglot” (switching all the language settings for relevant parts of the publication) but it’s not automatic.
The ($(GEN...)) reference might be generating incorrect intermediate output. To Test: Could you have a look at the final USFM tab and see if it looks right or wrong there?
There might be some unicode direction switching bytes getting supplied which are confusing something. To Test: Could you type (not copy and paste) a number range into an extra \r and see if that works?
I suppose it’s possible that for some reason \r might be forgetting that the document is RTL. To Test: could you type a range into some other line where things are formatting correctly?

Jim_Smith · October 3, 2023, 8:40am

Thanks for getting to this so quickly.

It’s a completely RTL document. I can’t recall whether I had to tell PTXprint that, or if it got that from Paratext.
Could you tell me how to do this? I don’t see that tab in PTXprint.
and 4. When I type the verse reference into either \s or \r it comes out wrong in the same way. But it otherwise seems to be in RTL mode. The other words are coming out in the correct RTL order.

\s ببب پیدایش ۱:۱-۲۷ ییی
\r ($(GEN 1:1-27))
\r پیدایش ۱:۱-۲۷
\r قیرست سکند
\ref GEN 1:1-27

Screenshot 2023-10-03 113302

mjpenny · October 3, 2023, 10:51am

Jim_Smith · October 3, 2023, 11:05am

Ah, there it is.

Yes, it is set to Right-to-Left, and with RTL book binding.

David_Gardner · October 3, 2023, 4:52pm

So urm… at least it’s consistent!
Very odd!

jeff_heath · October 4, 2023, 2:13am

RTL text is weird. Numbers are actually displayed in LTR format. So if you got rid of the colon and the hyphen, the correct display of that string of numbers would be:
i.e. “1127”, reading left-to-right

If I paste the text from your \s above into Word, I get the following (exactly what you are seeing in PTXprint):

However, if I paste it into LibreOffice Writer, I get the following:

(what you were hoping to get)

In this case, I think Word is actually more correct. The numbers are LTR, and the colon and the hyphen are “Neutral” in their bidirectionality (see Bidirectional text - Wikipedia). That means that they don’t force a direction on the text, but take it from the surrounding context. And since they are in the context of LTR characters (the numbers), they should keep the LTR direction, and that whole string of numbers and punctuation should be presented on the line in an LTR direction, as PTXprint is doing.

So from that perspective what PTXprint is producing is theoretically correct (from the bidirectional text algorithm). But that’s actually not what you want. We are reading our text RTL, and we want the different chunks (separated by punctuation) to appear one after the other from RTL, like shown in the LO Writer output above. (I don’t actually know why the LO Writer output is like that. It doesn’t seem like it is following the Unicode bidi algorithm…)

But I can get that behavior in Word by adding special marks called Right-to-left marks. They are Unicode codepoint U+200F (see Right-to-left mark - Wikipedia). You can insert these marks after a punctuation to force an RTL text direction for that punctuation. To do that in Word, put the insertion put after the colon (I would recommend using the arrow keys to find that spot), type “200F” and type Alt+X (hold down the Alt key and tap the X key). Your chapter one should now jump to the right of your string of numbers and punctuation. Do the same after the hyphen, and your string will look the same as the LO Writer output above.

But now the tricky part is getting those RTL marks inserted in the text for PTXprint. I haven’t tested this, but I think you might be able to use Changes.txt (on the Advanced tab) to create rules that will insert the RTL marks in the proper places. Try these rules (untested):
'(\d):(\d)' > '\1:\u200f\2'
'(\d)-(\d)' > '\1-\u200f\2'

But there is one more interesting issue. I notice that you are using the Eastern Arabic-Indic digits, which start at U+06F0. (See https://www.unicode.org/charts/PDF/U0600.pdf.) I’ve mostly used the plain Arabic-Indic digits, which start at U+0660. I think the \d digit designation should work for those digits as well, but if it doesn’t, you might need to use something like this: [\u06F1-\u06F9] in place of \d.

Anyway, that gives you something to try. And we’ll all be interested to know if you make progress!

Jeff

jeff_heath · October 4, 2023, 2:18am

And I should point out that this sort of manipulation is not something that the average PTXprint user should have to do. If this is indeed a global problem for RTL texts, then we should figure out a way for PTXprint to fix this “under the hood”, so the user doesn’t need to go to extreme measures to get the desired output.

Jim_Smith · October 4, 2023, 6:21am

Well, just so the goal is clear, here’s a picture of a published Farsi translation with a parallel passage indicator. The first one is a reference to Matthew 6:25-33, and the visual order is: 33-25:6 Matthew

You’re absolutely right that the digits are written LTR, but when you have delimiters and such, it’s like a sequence of digits is a single word. Then all the words get laid out in RTL order.

Microsoft Word is puzzling. If I type in the verse reference, it comes out correctly (in fact, whether I’ve set the paragraph to RTL or LTR). If I copy and paste from this browser window, it comes out incorrectly (again, whether the paragraph is RTL and LTR). WYSIAYG, I guess.

Here are some things that I can do to replicate the wrong result (in Word or other GUIs):

Place a LEFT-TO-RIGHT marker (U+200E) at the beginning of the reference.
If I remove the book name, the chapter/verse part gets laid out wrongly.

These suggest to me that the chapter/verse part is being laid out in LTR mode.

Here are some further tests. Verse references come out wrong when I put them in verse text. I can’t quite explain that, unless XeLaTeX isn’t being told at all that it’s an RTL context.

\id PHM
\h سسسس
\c 1
\cl
\s1 سسسس
\p \v 1 پیدایش ۱:۱-۲۸
\s1 سسسس
\p \v 2  ۱:۱-۲۸

Over to XeLaTeX…

If I use fontspec and bidi it comes out wrong whether the context is LTR or RTL:

\documentclass{book}
\usepackage{fontspec,bidi}
\setmainfont[Script=Arabic]{Times New Roman}
\begin{document}
\setRTL
۱:۱-۲۷

پیدایش ۱:۱-۲۷

\setLTR
۱:۱-۲۷

پیدایش ۱:۱-۲۷
\end{document}

Conversely, if I use xepersian, I can’t make it come out wrong. The chapter/verse portion gets laid out correctly in all four of these:

\documentclass{book}
\usepackage{xepersian}
\usepackage[fontsize=16pt]{fontsize}
\settextfont{Times New Roman}
\begin{document}
۱:۱-۲۷

پیدایش ۱:۱-۲۷

\beginL
۱:۱-۲۷
\endL

\beginL
پیدایش ۱:۱-۲۷
\endL
\end{document}

I’ll stop here. It looks to me as if xepersian has fixed a shortcoming in fontspec and bidi, but I don’t know what that is.

jeff_heath · October 4, 2023, 11:44am

The way I read it is that the output that you are seeing from PTXprint follows the Unicode bidi algorithm, although it doesn’t give you what you want. The bidi algorithm says that colon and hyphen are neutral characters, and will go with the flow of the surrounding characters. When in a string of LTR characters (like numbers), they will continue in LTR, giving you the result you see. The xepersian package apparently changes the bidi characteristics of those punctuation marks, giving you what you want. Interestingly, in Word, if you put a space after the colon and the hyphen, it will turn the reference around to the way you want it to look (but with extra space). This is actually quite strange because the spaces are also listed as Neutral in the bidi algorithm (see link above). I believe a number of RTL script projects have used this technique in Paratext to get the references to turn the “right” way around. But the addition of the RTL mark does the same thing without adding the space.

Can you maybe try putting those rules in the Changes.txt file, to see what that does?

Note that in your module you are entering your references in normal (Arabic!) numerals. Do you have any references in your Paratext project that are already entered with the Arabic-style (Hindi!) numerals? If so, how do they show up in Paratext? I believe Paratext automatically inserts an RTL mark U+200F in those sorts of references in an RTL project, to try to make them appear correctly. I assume that PTXprint could do a similar thing. (Note, however, that Paratext appears to insert the RTL mark before the colon, which I think is also a valid option.)

Jim_Smith · October 4, 2023, 12:28pm

Yes, if I put those commands in the changes file, the verses do come out correctly. Thank you for that fix.

If you just look at the chapter/verse part (۱:۱-۲۷) it’s operating according to the bidi algorithm. But by that algorithm you would expect the presence of a book name (made up of RTL characters) to switch it into RTL mode (پیدایش ۱:۱-۲۷). (In this text editor and in the preview window as I type this, that’s exactly what does happen.) So I’m imagining that the chapter/verse part is off in its own \hbox or something, and doesn’t realize it’s in an RTL document for some reason.

(I don’t want to take this too far afield, but there might be other problems lurking about if the document isn’t globally set to RTL. I notice that the footnote marker appears on the wrong side of words, for instance. They appear on the right side of the word instead of the left side. I don’t know for certain, but that seems like an LTR/RTL issue.)

I haven’t tried with Arabic-Indic digits. I haven’t ever run into a situation, though, where something worked with Arabic-Indic but not Eastern Arabic-Indix digits.

jeff_heath · October 4, 2023, 12:41pm

Well, it’s good that you have a work-around for the moment. I’ll try to discuss this with @mjpenny to see if something should be done in PTXprint.

If you check out the bidi algorithm again (Bidirectional text - Wikipedia), note that the digits are found in the “Weak” section. I had assumed that they were “Strong” characters, and would define the direction of the intervening “Neutral” characters. But I understand what you are saying now, that putting in the book name (in “Strong” RTL characters) overrules the “Weak” digits to define the direction of the “Neutral” characters.

jklassen · October 4, 2023, 1:33pm

Hello,

I’m not sure if I’m adding any information to this thread which is not already known. I’ve been a participant in typesetting numerous RTL projects. I want to simply add what I know about how Paratext interacts with RTL texts and references.

When editing a RTL project in Paratext, Paratext identifies strings which follow the format of a scripture reference, or a digit range – i.e. patterns like #:#, #.#, #:#-#, #:#,# etc. As these are identified in the open chapter being edited, Paratext inserts a U+200F before the punctuation characters, causing them to appear on the left side of the preceding number (overriding the otherwise LTR direction initiated by the number(s)).

So, instead of this:

you see this:

If you enter references in Paratext, in logical order, you will see this visual update occur as you complete entering them.

So, in projects which were edited in Paratext, these 200F characters may already be there. You might want to account for that in any changes.txt expressions you use. I believe it is true that Paratext only performs this insertion of 200F for chapters which have been opened in the editor (i.e. it does not just go through the whole project and do this).

A reason this was done in Paratext is so that any downstream publisher or publication tool could just render the text directly – especially important for some digital app paths where the publishing path does not necessarily allow interventions like changes.txt

Sharing what I understand in case it helps at all here.

Jeff