Search arabic text without tashkeel

Raja_Sand · February 16, 2019, 4:02am

To whom it may concern

Hi Dear,

I have created an arabic scripture that contains tashkeels.
When I search any word without giving tashkeel, app searches nothing.

I want app to ignore tashkeel while searching any word.

Please work on this option for next release.

Humble regards.
Raja Sand
Pakistan

jeff_heath · March 2, 2019, 5:02pm

I think the developers would need more information about what characters you would like to be ignored in the search. From my experience it probably would be good to ignore the U+0640 ARABIC TATWEEL, used for kashida justification. But do you have other characters in mind as well? The developers would need information on specific code points, which can be found in the Unicode standard:

Raja_Sand · March 4, 2019, 7:22am

Dear,

Following is the list of Tashkeel/Harakat, that I want to get ignored while searching any word.

List of Harakat (tashkeel)

mcquayi · March 4, 2019, 12:03pm

Thanks Jeff that does make it clearer.

Raja, the PDF you point to lists many code points. Are you saying you want each one included in the list to be ignored? From out point of view it would be clearer if you just list the code point you want SAB to ignore.

Raja_Sand · March 4, 2019, 12:30pm

Dear Ian,
Really became much happy to see that you are trying to resolve the issue. Thanks in anticipation.

Yes! I want all these characters to be ignored by SAB while searching any word.

Actually these are called Harkat/tashkeel. These are not the main characters, These character are applied with other character to change its meaning.

The same word with different harakat (one of these) changes the meaning completely. So, if any does not know the exact word, that came in text with harkat, he just write simple word without any harkat, then he must get his/her word/words with different harakats.

Hope now the question is clear.

mjpenny · March 4, 2019, 3:31pm

I am wondering whether one of the Search options (under Features) was designed to do something like this (except that it currently only specifies accents and tones). Perhaps it could be extended to other groups of characters that should be ignore when searching.

jeff_heath · March 5, 2019, 5:01pm

In my opinion, the tatweel should always be ignored. This character is simply used as a line filler, extending the baseline to make a word longer and better fill the line. Note that this character isn’t even mentioned in the PDF Raja pointed to, as it is a different kind of thing.

I think the characters in Raja’s PDF should not always be ignored, but I think adding the “Match accents and tones” option to the app is a great way to handle this issue. So Raja - you would need to add this option to your app configuration, as shown in mjpenny’s screen shot above. I would recommend checking both of the boxes that he circled in red. It won’t work yet for you, but once this change is made in SAB, the user can check this box on the search page in the app to ignore the harkat/tashkeel when he is doing a search.

And to the developers: I assume the tatweel (always) and the other characters (only if the option is checked) would be removed from the search text in addition to the text being searched? I could imagine a case where someone copies a word with tatweels and pastes it into the search box, but if you are removing tatweels from the text being searched, it wouldn’t be found unless you also removed the tatweels from the search text. Just a little sanity check…

Lorna · March 5, 2019, 5:30pm

All of the harakat characters in that document have an Mn category (Nonspacing_Mark) and that is the same category for other combining marks like acute and grave. So, I’d hope that turning off “Match accents and tones” might be the solution to the problem.

Also, I do know that in some cases tatweel/kashida is used as a mark holder (for example: tatweel plus combining hamza). In any case, the combining mark you would like to be ignored so maybe it’s okay to ignore tatweel as well.

Raja_Sand · March 6, 2019, 1:35am

Agreed!

It will better to add an option in search option menu either to ignore harkat/tashkeel or to search with harkat/tashkeel.

mcquayi · March 6, 2019, 11:37am

Raja,
I have reread your posts in TeamHelp and here.

You were asked the question “What are your settings for Features / Search?” That is the settings found that are like the picture below from SAB?

We need to know an answer to that question.
Have you tried different settings?
What differences do you see?

Raja_Sand · March 6, 2019, 11:55am

I have tried all these setting one by one, but all in vain.
After trying all possible options, I put my question in front of you.

mcquayi · March 6, 2019, 2:44pm

Thanks for that clarification.

joop · March 12, 2019, 6:01pm

Although we have not yet got to that, an APP is planned for an Arabic script language including vowels. This is in the future, but it is good I see this communication. I can not yet say anything about the FEATURES/SEARCH OPTIONS but I can say something about the vowels & signs ( tashkeel) that should be ignored for this upcoming APP. First I give you a link to a very nice and detailed overview of all Arabic Unicode characters (vowels, letters, digits, signs):

https://www.key-shortcut.com/en/writing-systems/ﺕﺏآ-arabic-alphabet/

On the right side of the page you can choose COMPACT or COMPLETE … bot are very helpful.

When you choose COMPACT you will see the Unicodes.
In my case U+063B up to U+0657 should be ignored.
in the SEARCH section of SAB

Everything that is named LETTER does not belong to TASHKEEL, same for DIGITS and LIGATURES. This for your information.

Raja_Sand · March 13, 2019, 1:48am

Nice to hear from you.

U+063B to U+064A are the main letters, these must be included in search option.

But the range, to be ignored, is from
U+618, U+64B to U+658, U+670

Raja_Sand · May 26, 2019, 11:37am

Hi Dear
@mcquayi @jeff_heath @mjpenny @Lorna @joop

Please any progress regarding the issue…

waiting for help…

joop · May 26, 2019, 12:56pm

Sorry, I agree with Raja about the range to be ignored:

U+0618, U+064B to U+0658, U+0670

I made a mistake in my message. Sorry!

expebition · June 24, 2019, 7:12pm

I propose the following new feature in SAB. Under Features > Search, we add 3 more options:

[ ] 'Match vowel markings' is selected by default
[ ] Show 'Match vowel markings' option on Search page
Vowel markings to ignore:
[empty text box]

Notes:

This could be useful for more than just Arabic, especially if the phrase ‘vowel markings’ as seen by the end user is configurable.
Alternative A: fill the ‘to ignore’ text box with this list by default: U+0618,U+064B,U+064C,U+064D,U+064E,U+064F,U+0650,U+0651,U+0652,U+0653,U+0654,U+0655,U+0656,U+0657,U+0658,U+0670
Alternative B: allow ranges, so the above list would be: U+0618, U+064B - U-0658, U+0670
Alternative C: use a regular expression rather than list of characters (difficult to type and visualize), e.g. s/[ًؘ-ٰ٘]//
All of the text should be converted to be Unicode decomposed form before stripping.
Both the text being searched and the search field should be stripped of the ‘to ignore’ characters before comparison.
For more information on Arabic diacritics, see the Wikipedia page: Arabic_diacritics (I’m not allowed to post links).
In a different Arabic dialect, this is the proposed list of characters to strip: U+0618 - U+061A, U+064B - U+0652

Would this satisfy everyone? Are there better ways of doing this?

Raja_Sand · June 25, 2019, 4:48am

Good Idea!!!
It will be a great feature and I think it is the resolution of the said issue…

Furthre, our developers know more and very well.

jeff_heath · June 25, 2019, 11:58am

I disagree. For one thing, I think you are just picking the characters that are important to you. For example, you say to exclude U+0618 ARABIC SMALL FATHA, but what about U+0619 ARABIC SMALL DAMMA? It is an almost identical character, and it may not be important for you to ignore, but it could be important for the next guy. And I also disagree with adding more check boxes in the app. Your ‘Match vowel markings’ is really the same thing as the ‘Match accents and tones’ option that exists already. You just want to change it to fit your specific need.

So I propose something much simpler. I propose that when ’Match accents and tones’ is not checked (i.e. you want to ignore accents and tones), it automatically ignores everything that is added to a base character (which includes Arabic vowels and combining diacritics in the U+0300 block), as defined in the Unicode standard. These characters all have a General_Category in the Unicode Character Database of:

Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)

FYI - the Unicode Character Database is found here:
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
And this is a description of how to read it:
https://www.unicode.org/reports/tr44/

What is currently ignored by the code if ‘Match accents and tones’ is not checked? Obviously apps currently created with the App Builders don’t ignore all of these characters marked Mn. If they did, then we wouldn’t be having this discussion, because all of the Arabic vowels are marked Mn, and could be easily ignored with the options currently available.

So I would propose that if the ‘Match accents and tones’ box is not selected, then app searches should ignore anything marked Mn in the database. This should be easy to do programmatically. (Programmer notes: In Python, I use if unicodedata.category(letter) == 'Mn'. And you would need to remove Mn characters from both the search word and text searched, before you do the comparison. I imagine ignoring all Mn characters would actually make the code simpler than it currently is, since currently it ignores just some select subset.)

I don’t think it’s a problem for Arabic users to have to figure out that ignoring accents and tones includes vowels. But if you are insistent that the checkbox should say something about vowels, then we could add a field "Custom text for ‘Match accents and tones’ " - and you could put your vowel text there. But its function really is replacing accents and tones, and I think Arabic users should be able to figure that out.

If we follow this proposal to ignore all the Mn characters, I think the field at the bottom needs to change. Right now it says, “Specify the replacements and diacritics to remove if not matching accents and tones.” For one thing, I think it should appear above the input buttons fields, right under the ‘Match accents and tones’ checkboxes. And I have trouble understanding the double negative. I think it should try to explain the combining marks first, then invite additional removals or replacements, something like this:

Ignoring accents and tones will automatically ignore nonspacing combining marks (as defined in the Unicode standard), things like combining diacritics and Arabic script combining vowels. Specify here (separated by spaces) any additional characters to ignore, or replacements to make in the search text. E.g. ‘é>e è>e’ would allow those two composed characters to be found with a simple ‘e’ in the search string. The string ‘_ -’ or ‘\u005F \u002D’ would ignore underscores and hyphens in the search. (Note that you can use \uXXXX to define a character by its Unicode hex code.)

I think making these changes will probably serve the needs of 99% of users. The cases that aren’t served are when a user doesn’t want to ignore certain characters marked Mn, but does want to ignore others. Because with the above, if you choose to ignore, all of the Mn characters will be ignored. I personally can’t think of a case when this would be needed. But I think 99% of those 1% of cases could be served by adding another simple field (below the field with the long explanation) that has this label:

Enter here characters to ignore or replacements to always use in the search, even when ‘Match accents and tones’ is selected.

I would think in this case you wouldn’t even want to show the ’Match accents and tones’ option on the Search page.

So that’s my proposal. I don’t think it would be very much work for the developers to implement (especially if we didn’t add the second field, but just kept the original field, changed its label and moved it up on the configuration page). And I think it makes more sense than making even more specific search criteria basically just for Arabic.

Previously we talked about always ignoring U+0640 ARABIC TATWEEL. I think that should still be done. But other characters would be ignored based on their character type, and whether the box is checked or not.

jeff_heath · June 25, 2019, 12:07pm

Just FYI - there are 96 Arabic characters that are marked Mn in the Unicode standard. The best way to find them is to look for the dotted circle “base” under combining characters on the 3 Arabic Unicode code pages (links below). But you can also search for “;Mn;” in the Unicode Character Database.