Ecma 262 by Karl Eklund

Ecma 262

v. vi.

e. f.

Increment k by 2. If the most significant bit in B is 0, then 1. Let C be the character with code unit value B. 2. If C is not in reservedSet, then a Let S be the String containing only the character C. 3. Else, C is in reservedSet a Let S be the substring of string from position start to position k included. vii. Else, the most significant bit in B is 1 1. Let n be the smallest non-negative number such that (B << n) & 0x80 is equal to 0. 2. If n equals 1 or n is greater than 4, throw a URIError exception. 3. Let Octets be an array of 8-bit integers of size n. 4. Put B into Octets at position 0. 5. If k + (3  (n – 1)) is greater than or equal to strLen, throw a URIError exception. 6. Let j be 1. 7. Repeat, while j < n a Increment k by 1. b If the character at position k is not ‗%‘, throw a URIError exception. c If the characters at position (k +1) and (k + 2) within string do not represent hexadecimal digits, throw a URIError exception. d Let B be the 8-bit value represented by the two hexadecimal digits at position (k + 1) and (k + 2). e If the two most significant bits in B are not 10, throw a URIError exception. f Increment k by 2. g Put B into Octets at position j. h Increment j by 1. 8. Let V be the value obtained by applying the UTF-8 transformation to Octets, that is, from an array of octets into a 21-bit value. If Octets does not contain a valid UTF-8 encoding of a Unicode code point throw an URIError exception. 9. If V is less than 0x10000, then a Let C be the character with code unit value V. b If C is not in reservedSet, then i. Let S be the String containing only the character C. c Else, C is in reservedSet i. Let S be the substring of string from position start to position k included. 10. Else, V is ≥ 0x10000 a Let L be (((V – 0x10000) & 0x3FF) + 0xDC00). b Let H be ((((V – 0x10000) >> 10) & 0x3FF) + 0xD800). c Let S be the String containing the two characters with code unit values H and L. Let R be a new String value computed by concatenating the previous value of R and S. Increase k by 1.

NOTE This syntax of Uniform Resource Identifiers is based upon RFC 2396 and does not reflect the more recent RFC 3986 which replaces RFC 2396. A formal description and implementation of UTF-8 is given in RFC 3629. In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higherorder bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bits of that octet contain bits from the value of the character to be encoded. The following octets all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded. The possible UTF-8 encodings of ECMAScript characters are specified in Table 21.

108