ietf
[Top] [All Lists]

Re: BOMs

2013-11-18 11:40:22
Bjoern Hoehrmann writes:

Perl's JSON module gives me

  malformed JSON string, neither array, object, number, string
  or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")

Python's json module gives me

  ValueError: No JSON object could be decoded

Go's "encoding/json" module gives me

  invalid character 'ï' looking for beginning of value

I'm curious to know what level you're invoking the parser at.  As
implied by my previous post about the Python 'requests' package, it
handles application/json resources by stripping any initial BOM it
finds -- you can try this with

import requests
r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json";)
r.json()

Signatures are not part of the text of a document, as the UNICODE spec
makes clear, so asking what happens when you pass a string beginning
with a BOM to a parser is not really the right question in this
context, is it?

As I tried to say in an earlier post, there's a distinction which
needs to be carefully insisted on between, on the one hand, languages
and their parsers, where I agree signatures/BOMs have no place, and,
on the other hand, (media-typed) resources/entities/payloads and _their_
processing, where a discussion of BOMs/signatures _is_ appropriate
and, often, necessary.

BTW I agree that the status of the UTF-8 BOM as signature is slightly
hazy, but again the UNICODE spec itself [1] says

  "this sequence can serve as signature for UTF-8 encoded text where
   the character set is unmarked"

ht

[1] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: 
ht(_at_)inf(_dot_)ed(_dot_)ac(_dot_)uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

<Prev in Thread] Current Thread [Next in Thread>