Phrase-ending strings instead of characters

A colleague preparing for an upcoming app-builder workshop works in a language whose orthography relies on words to indicate gramatical divisions rather than the standard punctuation symbols that aeneas expects. She asked if it is possible to define phrase breaks in SAB based on short conjunctions (eg. “te”, “date”, “chi”, etc). Are there characters (or \ codes) that can be included in the list of phrase-ending characters on the Audio Synchronization tab to designate a series of characters as a string instead of being seen as individual character in the list? Or is there another way to use a string to indicate phrase breaks?

Some languages in Asia make use of zero width spaces U+200B. Those could be added via the project Changes or via the Aeneas Changes for each of the words you mentioned. Then you would add that space in as a breaking character.

The only other way is to just do verse level syncing.

Ok, thanks. It’s good to hear that people have done this in other languages, and that approach makes sense.

I’ve used the Project Changes area before so I tried that first. I ran into the problem that when I enable change rules for some words, the Aeneas synchronization tool won’t launch. It seems to be related to how I’m defining the end of the string to be replaced, because it works to replace \btá\b with \u200B$1, but it doesn’t work to replace \bta\b with \u200B$1. When the 2nd change rule is enabled (the one with the non-accented vowel), the Aeneas tool won’t run. I get the same results with other tone pairs: a word with an accented final vowel works, but enabling the same rule with a non-accented vowel at the end doesn’t. I’m guessing that I’m doing something wrong in the regex code.

I was going to try the second option that you mentioned, but I don’t know for sure where to make the “Aeneas Changes”. Are you referring to the “Character replacements” page of the wizard where we can tweak the characters that eSpeak sees? I had assumed that the changes there wouldn’t be passed along to the text in the app that the highlighting algorithm sees.

Are your characters composed or decomposed? I assume composed.
Have you tried writing the word as unicode string \u0074\u0061?

Yes I was meaning the Character replacement.

I have not done testing on that so I am not sure.

Thanks for reminding me about using decomposed characters. The keyboard setup I currently use produces composed characters, and that’s what I typed the change rules with, but our Paratext project is configured to normalize with decomposed characters, so the change rules won’t match with anything. Unfortunately when I pasted the decomposed version with the accented vowel from Paratext into the “Find” box, it broke the one rule that was working. I was finally able to get it to work, though, by using the Unicode string for the two letters and then the combining acute accent. So the rule for adding the zero-width space to the string “tá” that seems to work so far is to Find \b\u0074\u0061\u02ca\b and Replace it with \u200B$1. I’m still not able to add the zero-width space to the non-accented version, though: enabling the rule to Find \b\u0074\u0061\b and replace it with \u200B$1 stops the Aeneas tool from running.

Just a note to document 2 things that I found I was doing wrong:

  • I was using the wrong notation to backreference the found text. I was using $1, but the correct notation is \1.
  • I was not grouping the regex expression with parenthesis.

The notation that works for me is to use (\bta\b) in the search field and \u200b\1 in the replace field.
We worked this out during an apps workshop recently in Mexico, so I wrote it up more completely for those involved in the Spanish RAB forum here.

Thanks for the feedback. Always good to hear of resolutions.