5/05/2012

How to fix the "Could not decode a text frame as UTF-8." bug

Sometimes Google Chrome throw a Could not decode a text frame as UTF-8 error. It happens when the server send invalid unicode characters (see Unicode surrogates) to the browser (via websockets or any other transport) and . I've found two work-around for this issue.

The first one is from my point of view, the best approach (the original code came from SockJS codebase). It removes all the invalid unicode characters from the string so you can send it from the server-side without further decoding.

/*
 * Fix the "Could not decode a text frame as UTF-8." bug #socket.io #nodejs #websocket
 *
 * Usage:
 *   cleanedString = filterUnicode(maybeHarmfulString);
 *
 * Original work-around from SockJS: https://github.com/sockjs/sockjs-node/commit/e0e7113f0f8bd8e5fea25e1eb2a8b1fe1413da2c
 * Other work-around: https://gist.github.com/2024272
 * 
 */

var escapable = /[\x00-\x1f\ud800-\udfff\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufff0-\uffff]/g;

function filterUnicode(quoted){

  escapable.lastIndex = 0;
  if( !escapable.test(quoted)) return quoted;

  return quoted.replace( escapable, function(a){
    return '';
  });
}

The second one takes another approach which seems valid (I only tested the former) but requires an extra decoding step on the other side:

/**
 * encode to handle invalid UTF
 * 
 * If Chrome tells you "Could not decode a text frame as UTF-8" when you try sending
 * data from nodejs, try using these functions to encode/decode your JSON objects.
 * 
 * see discussion here: http://code.google.com/p/v8/issues/detail?id=761#c8
 * see also, for browsers that don't have native JSON: https://github.com/douglascrockford/JSON-js
 * 
 * Any time you need to send data between client and server (or vice versa), encode before sending,
 * and decode upon receiving. This is useful, for example, if you are using socket.io for real-time
 * client/server communication of data fetched from a third-party service like Twitter, which might
 * contain Emoji, or other UTF characters outside the BMP.
 */
function strencode( data ) {
  return unescape( encodeURIComponent( JSON.stringify( data ) ) );
}

function strdecode( data ) {
  return JSON.parse( decodeURIComponent( escape ( data ) ) );
}

Hope this help !

[Update] Dougal Campbell made some important notes: “the second method preserves the original data, while the first strips out information, altering the original data”. Thus, the first method can lead to potential security leaks (see his comment).

5/01/2012

Unidecode for JavaScript (NodeJS)

Unidecode is JavaScript port of the perl module Text::Unicode. It takes UTF-8 data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration -- i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system.

See Text::Unicode for the original README file, including methodology and limitations.

Note that all the files named 'x??.php' in data are derived directly from the equivalent perl file, and both sets of files are distributed under the perl license, and not the BSD license.

Installation

$ npm install unidecode

Usage

$ node
    > var unidecode = require('unidecode');
> unidecode("aéà)àçé");
'aea)ace'
> unidecode("に間違いがないか、再度確認してください。再読み込みしてください。");
'niJian Wei iganaika, Zai Du Que Ren sitekudasai. Zai Du miIp misitekudasai. '

node-unidecode on Github

« »
 
 
Made with on a hot august night from an airplane the 19th of March 2017.