This document defines how a MonkeyScript interpreter will handle character encodings within script files.
Preemptive notes
Inside of MonkeyScript the interpreter first executes a core monkeyscript.js file which handles the actual execution of program's script files. This file is guaranteed to be pure US-ASCII, and not contain any invalid JS at the start of the file.
SpiderMonkey as a slight bug in it. If you save the following file as UTF-8 and try to execute it:
// These two are the same character
print("♥" === "\u2665");
print("♥".length);
print("\u2665".length);Then SpiderMonkey (by default) will print false, 3, and 1 (since the heart is a 3byte UTF-8 character). While other engines (Rhino, V8, and JavaScriptCore) will all print true, 1, and 1.
For the MonkeyScript environment this should be fixed so that strings read from files are properly read according to the file's encoding and translated into proper UTF-16.
Interpreting source code
MonkeyScript uses python's pep263 as a reference as well as suggestions of UTF-8 defaulting made by ServerJS members and due to the wide de-facto standard of it's use by default in other things such as XML.
A new type of error EncodingError is defined. monkeyscript.js should die and print a proper message when it catches one thrown from it's core eval(); this also gives programs an option to handle bad script files without just killing the program.
When executing a script file using exec() the interpreter should make a number of checks:
- If exec() was called without an encoding= param check the first 3 bytes
If the UTF-8 BOM ([0xEF, 0xBB, 0xBF]) is found at the start of the file define the encoding as 'UTF-8'.
If the UTF-16 BOM (little-endian) ([0xFF, 0xFE]) is found at the start of the file define the encoding as 'UTF-16LE'.
If the UTF-16 BOM (big-endian) ([0xFE, 0xFF]) is found at the start of the file define the encoding as 'UTF-16BE'.
- If encoding matches /^UTF-(8|16[LB]E)$/i then blank out any BOM found using whitespace.
- If a encoding is already known then define HASH, SLASH, CR, and LF based on it, otherwise define HASH, SLASH, CR, and LF based on US-ASCII.
- Read the first two lines in from the file up to a maximum of 256 bytes (two lines are used so shebangs can be supported as the first line while allowing encoding to be defined by the second)
- If you've passed this much length on the file without hitting a newline then it's likely that the file is in some unknown encoding in which CR and LF do not match ASCII, in this case we do not want to read through the entire file. It's also unlikely that a coding would be found after this length, as well we'd end up reading an entire file to do encoding checks if we were dealing with a huge single line minified file.
- If either line starts with a HASH (#) or two SLASHes (//)
- Check the line for a match to the regex /coding[=:]\s*([-_.A-Z0-9]+)/i (encoding checks and errors may be omitted if exec was called with an encoding= param)
- If a UTF-16 encoding was found match based on it, otherwise match based on US-ASCII (since we're omitting this check for manual encodings the only thing other than UTF-8 or ASCII we can have defined is UTF-16, and UTF-8 doesn't matter since the only characters that matter to us are ASCII characters)
- If a match is found define the encoding as the string found
If the encoding defined is unknown to the system throw an EncodingError
If an encoding was already found by previous tests and it is not a match to the encoding throw an EncodingError
- If a HASH was matched
- If the line has no other characters replace the HASH with whitespace
- Otherwise replace the # and the character after it each with a / for the encoding found (assume US-ASCII's / if unknown) to stop it from outputting a syntax error
- Check the line for a match to the regex /coding[=:]\s*([-_.A-Z0-9]+)/i (encoding checks and errors may be omitted if exec was called with an encoding= param)
If encoding is still unknown use the value of monkeyscript.defaultEncoding which is initialized inside monkeyscript.js with a default of 'UTF-8'.
- Strings should ideally be converted into UTF-16 when the script is compiled so that compiled scripts XDR? will already be encoded and most of the time encoding checks will only be done on the first run of the script.
This definition takes many factors and possibilities into account and gives flexibility in how things are handled.
- The default by default will be to read in programs as UTF-8.
As monkeyscript.defaultEncoding is defined by monkeyscript.js rather than fixed, there is good flexibility to alternate ways to determine default encoding, monkeyscript could even support an --encoding parameter to change it, and even something like --encoding=system to use the system's default as the default. (If monkeyscript.defaultEncoding is subject to abuse it might be an idea to make it readonly and have monkeyscript.js have the final say)
MonkeyScript will never allow JS to trip up on an included BOM and will take them as helpers
- Because there are options outside of the text headers it's possible to support encodings which don't match ASCII
- You can also explicitly specify encodings when you run js files from your program
- As the recommended form you may use a classic coding which is understood by many text editors comment to define encoding:
# -*- coding: ISO-8859-1 -*-
- This also works for VIM syntax if you are dealing with VIM editors instead:
# vim: set fileencoding=ISO-8859-1 :
- Setting a quick encoding is easy since the syntax is really just a simple match to coding with a = or :
# coding=ISO-8859-1
- Shebang lines can still be made use of while setting encoding as well
#!/usr/bin/monkeyscript # -*- coding: ISO-8859-1 -*-
- If you want to make sure that other interpreters don't choke on the # character the use of actual // comments are supported.
// -*- coding: ISO-8859-1 -*-
References
ServerJS: https://wiki.mozilla.org/ServerJS
ServerJS Character Sets: http://groups.google.com/group/serverjs/browse_thread/thread/7de2cba0637905b9
JS Interpreters and UTF8: http://groups.google.com/group/serverjs/browse_thread/thread/fb4f93dd6cb3a1ed