Is there a way to set a space as the sentence-end punctuation in HearThis? Currently the chapters are only being broken into a few blocks because most sentences, in the language in question, end with a space. How can a space (\U0020) be added into the “Additional characters to break text into blocks”?
No, there is currently no way to do that. However, I can see how this would be needed for languages such as the one you are working on. I will add this to my “to-do” list.
I just did a fairly comprehensive review of available Unicode characters, and there does not appear to be any character that displays as a space and functions “normally” as a sentence-ending punctuation character. I certainly don’t consider myself an expert in scripts, but in a way I find this omission a bit puzzling given the prominence of scriptio continua. Still, I imagine that even if such a codepoint existed, it would be relatively difficult to ensure that it was used consistently.
As an aside, in the language you are working with, is any character (e.g., a zero-width space) used to represent word breaks? For line-breaking purposes, is any position valid, or is there some way of knowing where it is okay to break a line?
For such languages, words are typically separate by a ZWSP (U+200B). But since they are invisible, they are often either inconsistently applied or converted from say a visible / in the text for high quality texts, like you are dealing with.
There is no sentence final space for the reasons you give and also that spaces have different functions in different languages. In Burmese, the space is a grammatical marker to identify noun phrases for example. In Thai it is used as a comma or period as in between items in a list.
I don’t know if HearThis contrasts comma and period but suffice to say, as usual, it’s more complicated than we would want.
HearThis uses Libpalaso’s CharacterUtils.IsSentenceFinalPunctuation and thus knows about all the codepoints that are officially deemed to be sentence-ending (including the period, exclamation mark, and question mark, plus around 70 others used in various scripts). It also allows the user to manually add to the list any other characters that should be used to break the text into recordable blocks. This makes it possible to add characters such as the colon, semi-colon, etc. when appropriate. It is possible to add comma to this list, but that is generally discouraged. At least in most Roman-script languages, breaking the text at commas would result in blocks that cannot be recorded naturally.
Where would I find the details on CharacterUtils.IsSentenceFinalPunctuation? I found a Github libpalaso project but didn’t see that file. Are there any written details on what is included?