I just wanted to share a couple of Regex thoughts, so you can understand better the [\W\w]* expression mentioned in this thread. Here are the definitions of those two special sets:
\w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
So putting them together into a character class (surrounded by square brackets) is like saying:
[\W\w]: Match any alphanumeric OR non-alphanumeric character
[\W\w]*: Match any alphanumeric OR non-alphanumeric character, 0 or more times
What you are really trying to do is match any character. There is a Regex character already for that (almost)… it’s the period, “.” That character matches any character… BUT it does not match a newline character. So the following expression is simpler and similar (but not identical) to what you have above:
The only difference between that and [\W\w]* is that this expression matches newline characters as well. Which is why it can go along, slurping up any character it sees until it gets to chapter 1. To do a similar thing with the period, you’d have to do something like this:
This finds (any number of characters followed by a newline), repeated 0 or more times. So in a sense the [\W\w]* may be a bit simpler. Especially since end of lines can be a little funny between Windows/Linux, etc. If \n doesn’t work, you could try \r\n or even \r as other possibilities. But my guess is that what I have above should work.
But there is a better way to define your ending condition. You said to look for \c followed by a space followed by a one and a word-breaking character. Do you know why you had to add that word-breaking character at the end? It’s because the star * (zero or more wildcard) is “greedy” (as is the plus +, one or more wildcard) and slurps up as much data as it can. So if you have a book with \c 19, or even \c 150, it would suck up everything until there, if you only specified \c\s1.
But there is a better way. If you put a question mark after the star, it makes it just look for the first occurrence, i.e. makes it non-greedy. So if you just want it to find the first \c in the text and stop there, you could use this:
But my instinct would be to use the other expression:
That reads a little better to me: Find \imt followed by lines of characters (ending with newlines) until you find a \c. Note that this search is slightly different than the previous one: it will only find \c at the beginning of a line. That’s normally where it should be anyway, but it’s important to be able to understand small differences in these regular expressions.
Hopefully someone will find my ramblings to be helpful in some way…