Perhaps of interest to some who live in counties that have an upcoming
election, I’ve been exploring the use of XSLT 3.0 Streaming for the
processing of very large data sets produced by voting equipment. These data
sets are commonly referred to as “Cast Vote Records” and describe
selections on ballots, the number of votes those selections represent,
their countability (among other things). NIST has released a Common Data
Format specification <https://github.com/usnistgov/CastVoteRecords> for
such records that can use XML.
There has been some concern that the XML representation of this information
is simply too large to process effectively. To test that premise, I
developed a test deck generator and tabulator
capable of naïvely tabulating the contests. Both are written in XSLT 3.0. I
generated test decks of various sizes to get a better idea of the
scalability of different processing approaches.
The first approach is to load the entire CVR set in memory and operate on
it. The second approach is to “burst-mode” stream each CVR using XSLT 3.0
Streaming. I ran each transform 25 times for each set and averaged the
*INPUT Size (CVRs)*
*INPUT Size (Megabytes)*
(Throughput for different input sizes)
(Example processing times for typical jurisdiction sizes using streaming)
Each 100k of CVR required 4GB to place in memory. This makes the approach
inherently hardware-bound. I surmise that the lower throughput at lower
input sizes is due to the startup cost of the XSLT processor.
Streaming processing took around 800MB of memory and stayed stable at or
above 100k CVR input size.
Large CVRs datasets can be processed effectively, so long as the approach
chosen can scale.
Test Environment: Windows 10 x64, 32GB RAM, Intel i7-7500U, Saxon-EE 10.2J
, Java 1.8.0_151-1-redhat
Hilton Roscoe LLC
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com