More non-breaking hyphen problems, from .docx files

This is a different topic from my recent question, so I’m starting a new thread. We are having this problem with Reading App Builder, but I assume that since SAB also can load .docx files, the problem would be there as well.

I noticed that if our source document has non-breaking hyphens in it (inserted from the menu, or Ctrl+Shift+_), they disappear when that text is put into an app. With a little more digging, I found that document.xml file embedded in the .docx file (if you rename it to .zip) marks non-breaking hyphens with a special code: <w:noBreakHyphen/>. But that code isn’t handled very well, even from within Word. If I save a document with non-breaking hyphens out as a text UTF-8 file, they show up as character U+001E! If I copy and paste into a text editor, they often turn to spaces. So it turns out that a Word non-breaking hyphen isn’t really a hyphen at all, but just some sort of code for its own internal purposes.

BUT I found that if you insert a real non-breaking hyphen it works fine. The easiest way to do that is to type 2011 and then type Alt-X. That is a real Unicode non-breaking hyphen, and in the document.xml file it shows up as: <w:t>‑</w:t>. Then it loads OK into the app builders, can be saved to text files, copied into text editors, etc.

So I don’t know exactly what should be done about this problem. Maybe in your .docx import can you replace these <w:noBreakHyphen/> codes with a real non-breaking hyphen U+2011? Currently it looks like you just drop them.

Hello Jeff,

Thank you for the careful investigation into this issue. It makes it some much easier to address!

I have added support for recognizing <w:noBreakHyphen/> in DocX parsing and insert a U+2011 character. This will be in the next version after 4.5.