Customizing Encoding for Web Services

Internally, we have a best-practice to use ISO-8559-1 as our encoding for XML due to its greater flexibility with special characters. Up to this point, we’ve written LotusScript or Java agents to generate XML via the print statement. This permitted us to simply choose the encoding when we wrote the document header.

We are venturing into creating official Lotus Web Service providers now, but I noted that both the WSDL and the SOAP responses are encoded as UTF-8. Unlike other design elements (javascript libraries, images, stylesheets, files, et cetera), there is no option via the web services design element properties to specify a character set.

I’ve tried updating the web site rules to change the response headers for ?wsdl and ?openWebService but with no luck.

Any ideas on how to do this? UTF-8 is too limiting because our users are always slipping special characters into our content and (depending on the browser) this results in either a graceful fail (or a very nasty one).

TIA,

s

Subject: Customizing Encoding for Web Services

UTF-8 is not limiting! It is an encoding for the full Unicode character set. UTF-8 uses 1 byte to represent any of the 128 characters in the strict ASCII subset, but if the high bit in any byte is turned on it temporarily switches to two, three, or four bytes per character. It uses two bytes for 1920 additional accented, non-Latin and special characters, three bytes for 61140 additional characters from various countries, cultures and languages, and four bytes for over a 1,000,000 more potential characters.

ISO-8859-1 is the limiting choice. It is, however, a bit more space efficient if all you care about is the Latin-1 set of characters, since it can encode them all in a single byte instead of requiring two bytes for any char greater than 0x80.

Subject: RE: Customizing Encoding for Web Services

Unfortunately, I speak from experience. Our users will regularly paste a special character into a text field (ex: a “Registered” trademark symbol). When that character is inserted into an XML document and passed to a browser, you will encounter issues. Firefox degrades gracefully while IE does not. If this XML is being used within an AJAX operation, IE’s XML parser will not recognize the response as valid XML and parsing fails. If, however, that identical XML response is instead marked as ISO-8559-1, then IE and Firefox will handle the character properly.

So, back to my original question: For those of us who do not want to use UTF-8 to encode their Web Service WSDL and SOAP responses, how can we configure Domino to cooperate?

Subject: RE: Customizing Encoding for Web Services

Well… UTF-8 is the default character encoding for XML, so it’s very hard to believe that the IE parser doesn’t recognize a properly constructed UTF-8 stream, but for all I know, you could be right. It’s been a while since I played with this stuff seriously. Anyhow, I’ve not played with WSDL and SOAP at all, so I can’t answer your immediate concern.

Subject: RE: Customizing Encoding for Web Services

“This is when I discovered that not all valid UTF8 characters are valid XML characters” → http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html

A character can be valid as UTF-8, but not a valid XML character, so . . .

Anybody else out there have a suggestion? I guess I could have every single webservice I use test/strip offending characters, but it seems to make more sense to simply “upgrade” to ISO-8559-1

Subject: RE: Customizing Encoding for Web Services

The example he gave, 0x1a, is a control character, not a graphic character, and most control characters are illegal in XML, regardless of the character encoding. But the example you gave, a Registered mark, is a graphic character. It is 0x00AE in UTF-16 Unicode, and 0xC20A in UTF-8, and definitely legal in XML. Every single graphic character that is representable in ISO-8859-1 is in fact representable in UTF-8 and legal in XML according to the standards. But while I’ve used the IE parser quite a lot, I can’t be sure that it really does process every legal UTF-8 character correctly, so I guess you gotta do whatever you gotta do in order to make it work.