Regular Expression (Regex) interactive tutorial site - GREAT!

NeilZubot · November 7, 2017, 11:14am

Just want to let the community know about a resource which can help one get a foothold on Regular Expression and use it in the ‘Changes’ tab of SAB, to powerful effect…

https://regexone.com

I didn’t initially know anything about Regular Expression but after a few hours on this site, I had enough understanding of it to create a regular expression to remove the introduction material from each book in an SAB project.

Find: \\imt[\W\w]*\\c\s1\b
Replace: \\c 1

Hope this helps some people.

mcquayi · November 7, 2017, 1:21pm

The use of [\W\w] I have never used, but may do now.
Thanks

NeilZubot · November 7, 2017, 1:46pm

There are probably better ways to write it…
it’s the only line I could get to work for me.

Friedo · November 13, 2017, 9:24am

Thanks a lot for sharing this. I had been looking for resources on Regex before… it’s great to have this kind of tutorial.

jeff_heath · November 14, 2017, 4:59pm

I just wanted to share a couple of Regex thoughts, so you can understand better the [\W\w]* expression mentioned in this thread. Here are the definitions of those two special sets:

\w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

So putting them together into a character class (surrounded by square brackets) is like saying:

[\W\w]: Match any alphanumeric OR non-alphanumeric character
[\W\w]*: Match any alphanumeric OR non-alphanumeric character, 0 or more times

What you are really trying to do is match any character. There is a Regex character already for that (almost)… it’s the period, “.” That character matches any character… BUT it does not match a newline character. So the following expression is simpler and similar (but not identical) to what you have above:

.*

The only difference between that and [\W\w]* is that this expression matches newline characters as well. Which is why it can go along, slurping up any character it sees until it gets to chapter 1. To do a similar thing with the period, you’d have to do something like this:

(.*\n)*

This finds (any number of characters followed by a newline), repeated 0 or more times. So in a sense the [\W\w]* may be a bit simpler. Especially since end of lines can be a little funny between Windows/Linux, etc. If \n doesn’t work, you could try \r\n or even \r as other possibilities. But my guess is that what I have above should work.

But there is a better way to define your ending condition. You said to look for \c followed by a space followed by a one and a word-breaking character. Do you know why you had to add that word-breaking character at the end? It’s because the star * (zero or more wildcard) is “greedy” (as is the plus +, one or more wildcard) and slurps up as much data as it can. So if you have a book with \c 19, or even \c 150, it would suck up everything until there, if you only specified \c\s1.

But there is a better way. If you put a question mark after the star, it makes it just look for the first occurrence, i.e. makes it non-greedy. So if you just want it to find the first \c in the text and stop there, you could use this:

Find: \\imt[\W\w]*?\\c
Replace: \\c

But my instinct would be to use the other expression:

Find: \\imt(.*\n)*?\\c
Replace: \\c

That reads a little better to me: Find \imt followed by lines of characters (ending with newlines) until you find a \c. Note that this search is slightly different than the previous one: it will only find \c at the beginning of a line. That’s normally where it should be anyway, but it’s important to be able to understand small differences in these regular expressions.

Hopefully someone will find my ramblings to be helpful in some way…

NeilZubot · November 15, 2017, 7:46am

Hey, this is great! Thanks for sharing. I’ve learnt some things and like your expression better than mine - much cleaner. Thanks!

ChrisHubbard · November 28, 2017, 4:26pm

https://regex101.com is another good site to test and debug regular expression. It has a nice interface for explaining what the regular expression is doing and colored matching of parts. It was recommended to my by one of the Paratext developers.

NeilZubot · November 29, 2017, 7:00am

Hey, that site looks awesome! Thanks!

dlear · July 27, 2021, 7:22am

Hello,

I need to enter (\xt…\xt* ) for all cross references, Is there a way by using regex for that?

Dler

GregAshleyCooper · July 27, 2021, 9:38am

@dlear Please expand on your request. Give an example of what you want changed and to what.

If you want to add \xt … \xt* to any reference in the text, you would need up to 66 regular expressions (more if you have references to Deuterocanonical books). SAB already does a fairly good job of linking text that looks like a reference based on you book names and abbreviations.

dlear · July 27, 2021, 9:58pm

Thank you for your help,
Yes, I need to add \xt… \xt * to any reference in the text. Is there an easy way to do that?
could you help me please?

GregAshleyCooper · July 28, 2021, 7:14am

There is no easy way. You could setup a lot of changes in SAB, but I would suggest adding the \xt in Paratext so that they can be checked for errors there. Paratext has RegEx Pal:

Paratext 8

Paratext 9

You would need to account for variations of the book names.
(Some texts have full names in intros and abbreviations in footnotes. I have seen texts using different/longer abbreviations in intros/extra material.)

NOTE: Use this with caution. It could “break” your text. MAKE SURE YOU MAKE A BACKUP OF YOUR TEXT BEFORE PROCEEDING.

RegEx Pal:
Select your Project. Start with Tools > Find (Ctrl + F), then use Replace (Ctrl + H).

Example above:

((?:Ps|Psalms?) \d+(?::\d+)?(?: - \d+:\d+|-\d+)?(?:; \d+:\d+|,\d+)*)

\\xt \1\\xt*

You might have to account for periods after abbreviations e.g. Ps\.

You could add multiple book names in there, but as you can see in the example, the text already has \xt's surrounding references across multiple books in reference tags, so adding book specific xt's to this is not a good idea in this case. I am just using it to demonstrate RegEx Pal.

You will have to experiment. There are always exceptions to the structure/form that your regex will find, which will not be changed.

dlear · July 28, 2021, 9:56am

Dear Brother,
Thank you in advance,
In Kurdish we have no abbreviations for the names of the books.
But as you see in the screenshot, the verse number does not enter the link.

GregAshleyCooper · July 28, 2021, 10:59am

I’ve reworked the find expression a bit:

((?:xyz) \d+(?::\d+(?:-\d+)?| - \d+)?(,\d+(?:-\d+)?|; \d+(?::\d+(?:-\d+)?| - \d+)?)*)

(We use spaces in chapter ranges (and e-dash or em-dash instead of hyphens).
You will have to adjust accordingly.)

This will find

If you want to have references from multiple books in \xt's (e.g. \xt Mt 19:4; Mk 10:6\xt*),
you might have to replace this afterwards

\\xt\*(; |,)\\xt

with

\1