Let's make searching in SAB work better

I want to describe a problem I’m having with searching in SAB apps. Let’s take some sample words that would be in my text:
1 اَمی
2 امی
3 اَمَی
4 بنیامین

Right now, if I search using exactly word 1, I get no results. This is even with the diacritic (064E) listed as a diacritic to remove for search in App Bulder. This is bad. I should get some results matching at least word 1.

The way Microsoft Word works, at least for Arabic script, is that searching for word 1 will match word 1 and word 3. This is nice. It requires the diacritic to be there where I have typed it, but for words that have an extra diacritic, it still matches it.

Also, in Word, if I check the Match diacritics box. I only match word 1. That makes sense.

Lastly, in Word, if I search for word 2 exactly (Match diacritics box not checked) I match all 4 words and with the Match diacritics box checked, I just match 2 and 4. Makes sense.

I would love for SAB to work the same way. Right now, if I search exactly for word 1 and turn on the Match accents and tones, it finds just word 1. That’s ok, but without turning on Match accents and tones as described above, it should return something and I propose it give me the same thing Word gives me. I don’t want to just have SAB remove the diacritic in the search word, because that would match all 4 words – matching too much.

(I’ve already communicated offline with Ron, so have some information on this issue already…)
Hi Ron,

It’s important to note first that you are talking about searching in SAB with the box “Match accents and tones” UNchecked. If the box is checked, then SAB will only match the sequence of letters and accents exactly, requiring any accents (or short vowels in Arabic, which can be thought of as combining diacritics) to be identical with the search term.

The reason you get no results if you search for exactly word 1 is because of a “sort-of” bug in SAB. I say “sort-of” because the app user left the box “Match accents and tones” UNchecked, meaning he wants to ignore accents and tones, and then he goes ahead and enters an accent (the Arabic short vowel) in his search term! If he wants to ignore accents, you would expect the search term to contain no accents, I would think. I believe the implementation of searches in SAB that do NOT match accents and tones actually uses a separate search database that has all of the accents and tones removed already. So if you send it a search term that has a diacritic, it of course will not find anything. To fix this “sort-of” bug, I would suggest that we pass the search term through the process of removing accents and tones before it is compared against the search database. Since the separate search database is created at build time, not at app run time, I’m hoping there will still be a function available at run time to strip the accents and tones using the replacement specifications in the builder.

The function that Word uses for its search that you say is “nice” is actually quite complex. If what I said above is true about SAB building a completely separate search database with accents and tones removed, that would have to be completely scrapped. SAB would have to do its search on the full unmodified text search database. And then there would have to be a complex RegEx created to search for base characters with accents when the accent is specified in the search string, and base characters with optional accents when the accent isn’t specified in the search string.

Me and ChatGPT came up with the following demo:

import re
import unicodedata

# Arabic short vowels and diacritics (Harakat)
ARABIC_DIACRITICS = "\u064B-\u0652"  # Fathatan to sukun

def decompose(text):
    """Decomposes Arabic text into base letters and diacritics separately using Unicode NFKD."""
    return [(c, unicodedata.combining(c)) for c in unicodedata.normalize("NFKD", text)]

def build_regex(search_string):
    """Constructs a regex pattern based on the search string:
    - If a letter has a diacritic, match it exactly.
    - If a letter has no diacritic, allow any diacritics (or none) in the text.
    """
    decomposed = decompose(search_string)
    pattern = ""

    for i, (char, is_diacritic) in enumerate(decomposed):
        if is_diacritic:
            # If this character is a diacritic, attach it to the previous letter
            pattern += char
        else:
            # If it's a base letter, allow optional diacritics unless the next character is a mandatory diacritic
            next_is_diacritic = (i + 1 < len(decomposed)) and decomposed[i + 1][1]
            if next_is_diacritic:
                pattern += char  # Exact match (diacritic will follow)
            else:
                pattern += char + f"[\u064B-\u0652]*"  # Allow any diacritic (or none)

    return pattern

def find_occurrences_arabic(search_string, text):
    """Finds all positions where search_string appears in text using the special matching rules."""
    regex_pattern = build_regex(search_string)
    matches = re.finditer(regex_pattern, unicodedata.normalize("NFKD", text))  # Normalize text for matching
    
    return [match.start() for match in matches]

# Example usage
text = "هَذَا نَصٌّ تَجْرِيبِيٌّ. التَّجْرِيبُ مُهِمٌّ فِي النُّصُوصِ."
search_string = "نص"  # Should match with or without short vowels
print(find_occurrences_arabic(search_string, text))

search_string = "نَصٌّ"  # Should only match exactly with fatha and shadda
print(find_occurrences_arabic(search_string, text))

This is to just give an idea of the approach that I think would be needed, which is: build a RegEx that matches a character that has an accent in the search term exactly, but any other base character can be allowed to have any combination of ANY combining diacritics. This example here only includes the Arabic short vowels as possible diacritics, but I assume a generic solution in SAB would have to include all of the possible combining diacritics (like those at U+03xx) in the expression.

I imagine this kind of search would be quite a lot slower than what we currently have, searching every base character for possible diacritics afterwards.

I think for the moment that we should just remove accents from the SAB search term, to fix the “sort-of” bug, which hopefully won’t be too difficult. And then we should have a hard think if we want to try to go further. (And also you say this is “nice” in Arabic, but what are the repercussions for searches in Roman script? Would we need to normalize all of the text to separate out all accents?)

Hope this is helpful for the discussion,
Jeff

Thanks Jeff. Very helpful.

One important point I want to make is that the user of the app may not know a lot about accents and tone marks, let alone know that by not having that switch turned on would mean ignore accents and tones. Even I was confused when I first did a search for word 1 where there is a vowel mark on the first letter and came up with no hits. I didn’t understand why I didn’t get results.
We need to make it as easy as possible for users to find things.

One alternate approach would be to have the default be to match accents and tones and have a switch to ‘Ignore accents and tones’. Paratext searching uses this approach. I don’t like this as much as the Word approach, but at least I would find exactly what I was searching for if I put word 1 in.