xsl-list
[Top] [All Lists]

Re: [xsl] is there a way to hash an element?

2016-06-10 02:59:45
Note that if these serializations end up being very long and you want
to reduce to a small signature (to match a typical hash), you can use
string-to-codepoints() function to generate a set of integers from any
string that can be used to roll-you-own hashing function. Since you
are just interested in checking that two descendant subtrees are
identical---and are not concerned with security---a very simple
compaction function would work fine. For example, you could create a
user-defined function that takes any sequence of integers and returns
the string X---Y, where X = length of sequence and Y is the remainder
of $seq!(position() * .) upon division by a suitably large number (an
extension of the typical UPC checksum algorithm).

On Fri, Jun 10, 2016 at 12:51 AM, Dimitre Novatchev
dnovatchev(_at_)gmail(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
You may even not need a hash function.

Just use the standard XPath 3.0 function:

  serialize()


http://www.w3.org/TR/xpath-functions-30/#func-serialize


Cheers,
Dimitre

On Thu, Jun 9, 2016 at 3:08 PM, Graydon graydon(_at_)marost(_dot_)ca
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
Hello all --

So I've got about half a gibabyte of XML messages describing various
health care actions.  Many of these are structural duplicates of each
other; the top elements differ by their attribute values, but the
structure and values of the descendant elements is the same.  The amount
of duplication varies from none to thousands.

I've got an apparently useful heuristic based on descendant attribute
values, but would -- it is health care data -- really like to have a
more robust way to group the elements into set of equivalent top-level
names by their structural sameness.  (I can't hand-check the whole data
set.)

So I find myself wanting an equivalent of sha256sum for elements so I
could generate a grouping key from the descendant elements and their
associated attributes as a unit.

Is there such a thing?  Equivalent approaches?

Thanks!
Graydon




--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
To achieve the impossible dream, try going to sleep.
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
Sanity is madness put to good use.
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.




-- 

"A false conclusion, once arrived at and widely accepted is not
dislodged easily, and the less it is understood, the more tenaciously
it is held." - Cantor's Law of Preservation of Ignorance.
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>