(I’ve already communicated offline with Ron, so have some information on this issue already…)
Hi Ron,
It’s important to note first that you are talking about searching in SAB with the box “Match accents and tones” UNchecked. If the box is checked, then SAB will only match the sequence of letters and accents exactly, requiring any accents (or short vowels in Arabic, which can be thought of as combining diacritics) to be identical with the search term.
The reason you get no results if you search for exactly word 1 is because of a “sort-of” bug in SAB. I say “sort-of” because the app user left the box “Match accents and tones” UNchecked, meaning he wants to ignore accents and tones, and then he goes ahead and enters an accent (the Arabic short vowel) in his search term! If he wants to ignore accents, you would expect the search term to contain no accents, I would think. I believe the implementation of searches in SAB that do NOT match accents and tones actually uses a separate search database that has all of the accents and tones removed already. So if you send it a search term that has a diacritic, it of course will not find anything. To fix this “sort-of” bug, I would suggest that we pass the search term through the process of removing accents and tones before it is compared against the search database. Since the separate search database is created at build time, not at app run time, I’m hoping there will still be a function available at run time to strip the accents and tones using the replacement specifications in the builder.
The function that Word uses for its search that you say is “nice” is actually quite complex. If what I said above is true about SAB building a completely separate search database with accents and tones removed, that would have to be completely scrapped. SAB would have to do its search on the full unmodified text search database. And then there would have to be a complex RegEx created to search for base characters with accents when the accent is specified in the search string, and base characters with optional accents when the accent isn’t specified in the search string.
Me and ChatGPT came up with the following demo:
import re
import unicodedata
# Arabic short vowels and diacritics (Harakat)
ARABIC_DIACRITICS = "\u064B-\u0652" # Fathatan to sukun
def decompose(text):
"""Decomposes Arabic text into base letters and diacritics separately using Unicode NFKD."""
return [(c, unicodedata.combining(c)) for c in unicodedata.normalize("NFKD", text)]
def build_regex(search_string):
"""Constructs a regex pattern based on the search string:
- If a letter has a diacritic, match it exactly.
- If a letter has no diacritic, allow any diacritics (or none) in the text.
"""
decomposed = decompose(search_string)
pattern = ""
for i, (char, is_diacritic) in enumerate(decomposed):
if is_diacritic:
# If this character is a diacritic, attach it to the previous letter
pattern += char
else:
# If it's a base letter, allow optional diacritics unless the next character is a mandatory diacritic
next_is_diacritic = (i + 1 < len(decomposed)) and decomposed[i + 1][1]
if next_is_diacritic:
pattern += char # Exact match (diacritic will follow)
else:
pattern += char + f"[\u064B-\u0652]*" # Allow any diacritic (or none)
return pattern
def find_occurrences_arabic(search_string, text):
"""Finds all positions where search_string appears in text using the special matching rules."""
regex_pattern = build_regex(search_string)
matches = re.finditer(regex_pattern, unicodedata.normalize("NFKD", text)) # Normalize text for matching
return [match.start() for match in matches]
# Example usage
text = "هَذَا نَصٌّ تَجْرِيبِيٌّ. التَّجْرِيبُ مُهِمٌّ فِي النُّصُوصِ."
search_string = "نص" # Should match with or without short vowels
print(find_occurrences_arabic(search_string, text))
search_string = "نَصٌّ" # Should only match exactly with fatha and shadda
print(find_occurrences_arabic(search_string, text))
This is to just give an idea of the approach that I think would be needed, which is: build a RegEx that matches a character that has an accent in the search term exactly, but any other base character can be allowed to have any combination of ANY combining diacritics. This example here only includes the Arabic short vowels as possible diacritics, but I assume a generic solution in SAB would have to include all of the possible combining diacritics (like those at U+03xx) in the expression.
I imagine this kind of search would be quite a lot slower than what we currently have, searching every base character for possible diacritics afterwards.
I think for the moment that we should just remove accents from the SAB search term, to fix the “sort-of” bug, which hopefully won’t be too difficult. And then we should have a hard think if we want to try to go further. (And also you say this is “nice” in Arabic, but what are the repercussions for searches in Roman script? Would we need to normalize all of the text to separate out all accents?)
Hope this is helpful for the discussion,
Jeff