Abel Braaksma wrote:
Jesper Tverskov wrote:
It is impossible to come up with a REGEX that can handle any
combination of upper case and lower case. What about PaulMcCartney or
JFK? If pascal notation is not used, XxxxXxxxx, or a similar strict
pattern, a REGEX solution is only possible if we know all input
strings from the start.
all provided solutions work with any combination of upper case and
lower case. Which of the examples did you try?
PaulMcCartney would become Paul Mc Cartney with any of them.
Perhaps I misunderstood what you are implying (should Mc Cartney be
written McCartney? I didn't know). But if you mean that you want a list
of exceptions that do not need to be split into words, then you are
right: you'll need that list. We know little from the OP, we are only
guessing here. I.e., is the string in one field, or is it part of a
larger string? Should consecutive capitals be ignored or not? Are there
exceptions? Can a string contain non-latin characters, or punctuation? I.e.:
1. O'Reilly >>>> O'Reilly
2. McDonald's >>>> McDonald's
3. Paul McCartney >>>> Paul McCartney
4. J.K.Rowling >>>> J.K. Rowling (?)
5. JKRowling >>>> J K Rowling (?)
6. JFK >>>> JFK
7. BankOfUSA >>>> Bank Of USA
1, 5 and 6 go well with my last regex, using "\{Lu}+".
For the rest, I think you need an exceptions list, which you can place
as alternates at the start of the regex (which may yield funny results
when the OPs text is from a larger corpus).
But all I'm doing is guessing on the requirements. Perhaps Babu will
enlighten us? ;)
Cheers
-- Abel Braaksma
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--