Skip to content

Commit

Permalink
Mention string iteration goes by Unicode characters (mdn#18504)
Browse files Browse the repository at this point in the history
* Mention string iteration goes by Unicode characters

* add a flag example
  • Loading branch information
Josh-Cena authored Jul 19, 2022
1 parent 5f210be commit c2d5d7c
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 45 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,23 @@ str[Symbol.iterator]

A new iterator object.

## Description

A String is [iterable](/en-US/docs/Web/JavaScript/Reference/Iteration_protocols) because it implements the `@@iterator` method. It means strings can be used in [`for...of`](/en-US/docs/Web/JavaScript/Reference/Statements/for...of) loops, be [spread](/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax) in arrays, etc.

Strings are iterated by Unicode codepoints. This means grapheme clusters will be split, but surrogate pairs will be preserved.

```js
// "Backhand Index Pointing Right: Dark Skin Tone"
[..."👉🏿"]; // ['👉', '🏿']
// splits into the basic "Backhand Index Pointing Right" emoji and
// the "Dark skin tone" emoji

// "Family: Man, Boy"
[..."👨‍👦"]; // [ '👨', '‍', '👦' ]
// splits into the "Man" and "Boy" emoji, joined by a ZWJ
```

## Examples

### Using \[@@iterator]\()
Expand Down
71 changes: 46 additions & 25 deletions files/en-us/web/javascript/reference/global_objects/string/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ In C, the `strcmp()` function is used for comparing strings. In JavaScript,
you just use the [less-than and greater-than operators](/en-US/docs/Web/JavaScript/Reference/Operators):

```js
let a = 'a'
let b = 'b'
const a = 'a';
const b = 'b';
if (a < b) { // true
console.log(a + ' is less than ' + b)
} else if (a > b) {
Expand All @@ -92,10 +92,9 @@ to compare without regard to upper or lower case characters, use a function simi
this:

```js
function isEqual(str1, str2)
{
return str1.toUpperCase() === str2.toUpperCase()
} // isEqual
function isEqual(str1, str2) {
return str1.toUpperCase() === str2.toUpperCase();
}
```

Upper case is used instead of lower case in this function, due to problems with certain
Expand All @@ -115,8 +114,8 @@ will automatically wrap the string primitive and call the method or perform the
lookup on the wrapper object instead.

```js
let s_prim = 'foo'
let s_obj = new String(s_prim)
const s_prim = 'foo'
const s_obj = new String(s_prim)

console.log(typeof s_prim) // Logs "string"
console.log(typeof s_obj) // Logs "object"
Expand All @@ -130,10 +129,10 @@ using {{jsxref("Global_Objects/eval", "eval()")}}. Primitives passed to
all other objects are, by returning the object. For example:

```js
let s1 = '2 + 2' // creates a string primitive
let s2 = new String('2 + 2') // creates a String object
console.log(eval(s1)) // returns the number 4
console.log(eval(s2)) // returns the string "2 + 2"
const s1 = '2 + 2'; // creates a string primitive
const s2 = new String('2 + 2'); // creates a String object
console.log(eval(s1)); // returns the number 4
console.log(eval(s2)); // returns the string "2 + 2"
```

For these reasons, the code may break when it encounters `String` objects
Expand Down Expand Up @@ -172,36 +171,58 @@ Special characters can be encoded using escape sequences:
Sometimes, your code will include strings which are very long. Rather than having lines
that go on endlessly, or wrap at the whim of your editor, you may wish to specifically
break the string into multiple lines in the source code without affecting the actual
string contents. There are two ways you can do this.

#### Method 1
string contents.

You can use the [+](/en-US/docs/Web/JavaScript/Reference/Operators/Addition)
You can use the [`+`](/en-US/docs/Web/JavaScript/Reference/Operators/Addition)
operator to append multiple strings together, like this:

```js
let longString = "This is a very long string which needs " +
"to wrap across multiple lines because " +
"otherwise my code is unreadable."
const longString = "This is a very long string which needs " +
"to wrap across multiple lines because " +
"otherwise my code is unreadable."
```

#### Method 2

You can use the backslash character (`\`) at the end of each line to
Or you can use the backslash character (`\`) at the end of each line to
indicate that the string will continue on the next line. Make sure there is no space or
any other character after the backslash (except for a line break), or as an indent;
otherwise it will not work.

That form looks like this:

```js
let longString = "This is a very long string which needs \
const longString = "This is a very long string which needs \
to wrap across multiple lines because \
otherwise my code is unreadable."
```

Both of the above methods result in identical strings.

### UTF-16 characters, Unicode codepoints, and grapheme clusters

Strings are represented fundamentally as sequences of [UTF-16 code units](https://en.wikipedia.org/wiki/UTF-16). In UTF-16 encoding, every code unit is exact 16 bits long. This means there are a maximum of 2<sup>16</sup>, or 65536 possible characters representable as single UTF-16 code units. This character set is called the [basic multilingual plane (BMP)](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane), and includes the most common characters like the Latin, Greek, Cyrillic alphabets, as well as many Easy Asian characters. Each code unit can be written in a string with `\u` followed by exactly four hex digits.

However, the entire Unicode character set is much, much bigger than 65536. The extra characters are stored in UTF-16 as _surrogate pairs_, which are pairs of 16-bit code units that represent a single character. To avoid ambiguity, the two parts of the pair must be between `0xD800` and `0xDFFF`, and these code units are not used to encode single-code-unit characters. Therefore, "lone surrogates" are often not valid values for string manipulation — for example, [`encodeURI()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI) will throw a {{jsxref("URIError")}} for lone surrogates. Each Unicode character, comprised of one or two UTF-16 code units, is also called a _Unicode codepoint_. Each Unicode codepoint can be written in a string with `\u{xxxxxx}` where `xxxxxx` represents 1–6 hex digits.

On top of Unicode characters, there are certain sequences of Unicode characters that should be treated as one visual unit, known as a _grapheme cluster_. The most common case is emojis: many emojis that have a range of variations are actually formed by multiple emojis, usually joined by the \<ZWJ> (`U+200D`) character.

You must be careful which level of characters you are iterating on. For example, [`split("")`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split) will split by UTF-16 code units and will separate surrogate pairs. String indexes also refer to the index of each UTF-16 code unit. On the other hand, [`@@iterator()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/@@iterator) iterates by Unicode codepoints. Iterating through grapheme clusters will require some custom code.

```js
"😄".split(""); // ['\ud83d', '\ude04']; splits into two lone surrogates

// "Backhand Index Pointing Right: Dark Skin Tone"
[..."👉🏿"]; // ['👉', '🏿']
// splits into the basic "Backhand Index Pointing Right" emoji and
// the "Dark skin tone" emoji

// "Family: Man, Boy"
[..."👨‍👦"]; // [ '👨', '‍', '👦' ]
// splits into the "Man" and "Boy" emoji, joined by a ZWJ

// The United Nations flag
[..."🇺🇳"]; // [ '🇺', '🇳' ]
// splits into two "region indicator" letters "U" and "N".
// All flag emojis are formed by joining two region indicator letters
```

## Constructor

- {{jsxref("String/String", "String()")}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,45 +18,33 @@ The **`length`** property of a {{jsxref("String")}} object contains the length o

## Description

This property returns the number of code units in the string. {{interwiki("wikipedia", "UTF-16")}}, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by `length` to not match the actual number of characters in the string.
This property returns the number of code units in the string. {{interwiki("wikipedia", "UTF-16")}}, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by `length` to not match the actual number of Unicode characters in the string.

ECMAScript 2016 (ed. 7) established a maximum length of `2^53 - 1` elements. Previously, no maximum length was specified. In Firefox, strings have a maximum length of `2**30 - 2` (\~1GB). In versions prior to Firefox 65, the maximum length was `2**28 - 1` (\~256MB).

For an empty string, `length` is 0.

The static property `String.length` is unrelated to the length of strings, it's the arity of the `String` function (loosely, the number of formal parameters it has), which is 1.
The static property `String.length` is unrelated to the length of strings. It's the [arity](/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/length) of the `String` function (loosely, the number of formal parameters it has), which is 1.

## Unicode

Since \`length\` counts code units instead of characters, if you want to get the number of characters you need something like this:
Since `length` counts code units instead of characters, if you want to get the number of characters you need something like this:

```js
function getCharacterLength (str) {
function getCharacterLength(str) {
// The string iterator that is used here iterates over characters,
// not mere code units
return [...str].length;
}

console.log(getCharacterLength('A\uD87E\uDC04Z')); // 3

// While not recommended, you could add this to each string as follows:

Object.defineProperty(String.prototype, 'charLength', {
get () {
return getCharacterLength(this);
}
});

console.log('A\uD87E\uDC04Z'.charLength); // 3
```

## Examples

### Basic usage

```js
let x = 'Mozilla';
let empty = '';
const x = 'Mozilla';
const empty = '';

console.log(x + ' is ' + x.length + ' code units long');
/* "Mozilla is 7 code units long" */
Expand All @@ -67,10 +55,11 @@ console.log('The empty string has a length of ' + empty.length);

### Assigning to length

Because string is a primitive, attempting to assign a value to a string's `length` property has no observable effect, and will throw in [strict mode](/en-US/docs/Web/JavaScript/Reference/Strict_mode).

```js
let myString = "bluebells";
const myString = "bluebells";

// Attempting to assign a value to a string's .length property has no observable effect.
myString.length = 4;
console.log(myString);
// expected output: "bluebells"
Expand Down

0 comments on commit c2d5d7c

Please sign in to comment.