I disagree. For one thing, I think you are just picking the characters that are important to you. For example, you say to exclude U+0618 ARABIC SMALL FATHA, but what about U+0619 ARABIC SMALL DAMMA? It is an almost identical character, and it may not be important for you to ignore, but it could be important for the next guy. And I also disagree with adding more check boxes in the app. Your ‘Match vowel markings’ is really the same thing as the ‘Match accents and tones’ option that exists already. You just want to change it to fit your specific need.
So I propose something much simpler. I propose that when ’Match accents and tones’ is not checked (i.e. you want to ignore accents and tones), it automatically ignores everything that is added to a base character (which includes Arabic vowels and combining diacritics in the U+0300 block), as defined in the Unicode standard. These characters all have a General_Category in the Unicode Character Database of:
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
FYI - the Unicode Character Database is found here:
And this is a description of how to read it:
What is currently ignored by the code if ‘Match accents and tones’ is not checked? Obviously apps currently created with the App Builders don’t ignore all of these characters marked
Mn. If they did, then we wouldn’t be having this discussion, because all of the Arabic vowels are marked
Mn, and could be easily ignored with the options currently available.
So I would propose that if the ‘Match accents and tones’ box is not selected, then app searches should ignore anything marked
Mn in the database. This should be easy to do programmatically. (Programmer notes: In Python, I use
if unicodedata.category(letter) == 'Mn'. And you would need to remove Mn characters from both the search word and text searched, before you do the comparison. I imagine ignoring all
Mn characters would actually make the code simpler than it currently is, since currently it ignores just some select subset.)
I don’t think it’s a problem for Arabic users to have to figure out that ignoring accents and tones includes vowels. But if you are insistent that the checkbox should say something about vowels, then we could add a field "Custom text for ‘Match accents and tones’ " - and you could put your vowel text there. But its function really is replacing accents and tones, and I think Arabic users should be able to figure that out.
If we follow this proposal to ignore all the
Mn characters, I think the field at the bottom needs to change. Right now it says, “Specify the replacements and diacritics to remove if not matching accents and tones.” For one thing, I think it should appear above the input buttons fields, right under the ‘Match accents and tones’ checkboxes. And I have trouble understanding the double negative. I think it should try to explain the combining marks first, then invite additional removals or replacements, something like this:
Ignoring accents and tones will automatically ignore nonspacing combining marks (as defined in the Unicode standard), things like combining diacritics and Arabic script combining vowels. Specify here (separated by spaces) any additional characters to ignore, or replacements to make in the search text. E.g. ‘é>e è>e’ would allow those two composed characters to be found with a simple ‘e’ in the search string. The string ‘_ -’ or ‘\u005F \u002D’ would ignore underscores and hyphens in the search. (Note that you can use \uXXXX to define a character by its Unicode hex code.)
I think making these changes will probably serve the needs of 99% of users. The cases that aren’t served are when a user doesn’t want to ignore certain characters marked
Mn, but does want to ignore others. Because with the above, if you choose to ignore, all of the
Mn characters will be ignored. I personally can’t think of a case when this would be needed. But I think 99% of those 1% of cases could be served by adding another simple field (below the field with the long explanation) that has this label:
Enter here characters to ignore or replacements to always use in the search, even when ‘Match accents and tones’ is selected.
I would think in this case you wouldn’t even want to show the ’Match accents and tones’ option on the Search page.
So that’s my proposal. I don’t think it would be very much work for the developers to implement (especially if we didn’t add the second field, but just kept the original field, changed its label and moved it up on the configuration page). And I think it makes more sense than making even more specific search criteria basically just for Arabic.
Previously we talked about always ignoring U+0640 ARABIC TATWEEL. I think that should still be done. But other characters would be ignored based on their character type, and whether the box is checked or not.