Re: How to avoid auto-linking in non-ascii URLs

2006-03-23 05:39:54
Hi, thank you for your quick reply.

In <200603221712(_dot_)k2MHC9230101(_at_)gator(_dot_)earlhood(_dot_)com>,
 earl(_at_)earlhood(_dot_)com wrote:
On March 23, 2006 at 01:36, Masao Takaku wrote:

MHonArc outputs links of URL-like strings automatically.
When a message includes a string "See";,
MHonARC process this as follows;

See <a 

It works well, but in case of an URL-like string followed by non-ASCII
text without space, this feature is not usefull;
e.g. "を見て.";, which means
"See"; in Japanese, goes to as follows:


In this example, the outputs should be like the following:


My environment is Perl-5.8.0 and MHonArc-2.6.15 (default setting).

Does anyone know how to do this, or any workarounds?

First, you may want to check out <> for
Japanese-specific usage information MHonArc.  There should also
be links to a Japanese-based mailing list which may be useful.

<>, rcfile for
ISO-2022-JP encoding, is a good resouce and works fine.
Using the resouce settings based on ISO-2022-JP, URL-linking has
limited only for non-ASCII text. This seems to be workaround for my

As for your specific problem, you may need disable URL linking.
This can be done by specify -nourl on the command-line or
<NOURL> in your resource file.  The '&' is a legal URL character,
and MHonArc does not try to interpret what character entity reference
values resolve to to determine if it should be included.

Nop... disabling URL linking is not what I have wanted.
# URL linking is almostly successful except for non-ASCII URLs.


It's true that '&' is a legal URL character, but "U+3092" is an
invalid character for URL and a numerical entity "&#x3092;" is a
equivalent to "U+3092" in HTML. And how to interpret non-ASCII-URLs
in at least Japanese encodings is very dependent on browser/server

Is this assumption also true in other languages/encodings?

If so, I think that MHonARC, even in default settings, should treat
these characters as invalid URL characters in URL linking code.

The URL linking code is a single regex operation.

I'm not sure at this time on what code changes could be done.
If you go with ISO-2022-JP encoding for your archives, it may
avoid this problem.

Masao Takaku  //  masao(_at_)nii(_dot_)ac(_dot_)jp