I disagree. For one thing, I think you are just picking the characters that are important to you. For example, you say to exclude U+0618 ARABIC SMALL FATHA, but what about U+0619 ARABIC SMALL DAMMA? It is an almost identical character, and it may not be important for you to ignore, but it could be important for the next guy. And I also disagree with adding more check boxes in the app. Your âMatch vowel markingsâ is really the same thing as the âMatch accents and tonesâ option that exists already. You just want to change it to fit your specific need.
So I propose something much simpler. I propose that when âMatch accents and tonesâ is not checked (i.e. you want to ignore accents and tones), it automatically ignores everything that is added to a base character (which includes Arabic vowels and combining diacritics in the U+0300 block), as defined in the Unicode standard. These characters all have a General_Category in the Unicode Character Database of:
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
FYI - the Unicode Character Database is found here:
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
And this is a description of how to read it:
https://www.unicode.org/reports/tr44/
What is currently ignored by the code if âMatch accents and tonesâ is not checked? Obviously apps currently created with the App Builders donât ignore all of these characters marked Mn
. If they did, then we wouldnât be having this discussion, because all of the Arabic vowels are marked Mn
, and could be easily ignored with the options currently available.
So I would propose that if the âMatch accents and tonesâ box is not selected, then app searches should ignore anything marked Mn
in the database. This should be easy to do programmatically. (Programmer notes: In Python, I use if unicodedata.category(letter) == 'Mn'
. And you would need to remove Mn characters from both the search word and text searched, before you do the comparison. I imagine ignoring all Mn
characters would actually make the code simpler than it currently is, since currently it ignores just some select subset.)
I donât think itâs a problem for Arabic users to have to figure out that ignoring accents and tones includes vowels. But if you are insistent that the checkbox should say something about vowels, then we could add a field "Custom text for âMatch accents and tonesâ " - and you could put your vowel text there. But its function really is replacing accents and tones, and I think Arabic users should be able to figure that out.
If we follow this proposal to ignore all the Mn
characters, I think the field at the bottom needs to change. Right now it says, âSpecify the replacements and diacritics to remove if not matching accents and tones.â For one thing, I think it should appear above the input buttons fields, right under the âMatch accents and tonesâ checkboxes. And I have trouble understanding the double negative. I think it should try to explain the combining marks first, then invite additional removals or replacements, something like this:
Ignoring accents and tones will automatically ignore nonspacing combining marks (as defined in the Unicode standard), things like combining diacritics and Arabic script combining vowels. Specify here (separated by spaces) any additional characters to ignore, or replacements to make in the search text. E.g. âĂ©>e Ăš>eâ would allow those two composed characters to be found with a simple âeâ in the search string. The string â_ -â or â\u005F \u002Dâ would ignore underscores and hyphens in the search. (Note that you can use \uXXXX to define a character by its Unicode hex code.)
I think making these changes will probably serve the needs of 99% of users. The cases that arenât served are when a user doesnât want to ignore certain characters marked Mn
, but does want to ignore others. Because with the above, if you choose to ignore, all of the Mn
characters will be ignored. I personally canât think of a case when this would be needed. But I think 99% of those 1% of cases could be served by adding another simple field (below the field with the long explanation) that has this label:
Enter here characters to ignore or replacements to always use in the search, even when âMatch accents and tonesâ is selected.
I would think in this case you wouldnât even want to show the âMatch accents and tonesâ option on the Search page.
So thatâs my proposal. I donât think it would be very much work for the developers to implement (especially if we didnât add the second field, but just kept the original field, changed its label and moved it up on the configuration page). And I think it makes more sense than making even more specific search criteria basically just for Arabic.
Previously we talked about always ignoring U+0640 ARABIC TATWEEL. I think that should still be done. But other characters would be ignored based on their character type, and whether the box is checked or not.