Skip to content

Commit ada345a

Browse files
authored
Correct code point explanation (elixir-lang#1554)
The example now has a grapheme made of multiple code points. The "ł" example was removed, as it's not clear how to compose it out of 2 code points. Closes elixir-lang#1553.
1 parent df4f027 commit ada345a

File tree

1 file changed

+6
-4
lines changed

1 file changed

+6
-4
lines changed

getting-started/binaries-strings-and-char-lists.markdown

+6-4
Original file line numberDiff line numberDiff line change
@@ -49,18 +49,20 @@ Elixir uses UTF-8 to encode its strings, which means that code points are encode
4949

5050
Besides defining characters, UTF-8 also provides a notion of graphemes. Graphemes may consist of multiple characters that are often perceived as one. For example, `é` can be represented in Unicode as a single character. It can also be represented as the combination of the character `e` and the acute accent character `´` into a single grapheme.
5151

52-
In other words, what we would expect to be a single character, such as `é` or `ł`, can in practice be multiple characters, each represented by potentially multiple bytes. Consider the following:
52+
In other words, what we would expect to be a single character, such as é, can in practice be multiple codepoints (in this case, e and an acute accent), each represented by potentially multiple bytes. Consider the following:
5353

5454
```elixir
55-
iex> string = "hełło"
56-
"hełło"
55+
iex> string = "héllo"
56+
"héllo"
5757
iex> String.length(string)
5858
5
59+
iex> length(String.to_charlist(string))
60+
6
5961
iex> byte_size(string)
6062
7
6163
```
6264

63-
`String.length/1` counts graphemes, but `byte_size/1` reveals the number of underlying raw bytes needed to store the string when using UTF-8 encoding. UTF-8 requires one byte to represent the characters `h`, `e`, and `o`, but two bytes to represent `ł`.
65+
`String.length/1` counts graphemes and returned 5. To count the number of code points, we can use `String.to_charlist/1` to convert a string to a list of codepoints, and then we get its length, which returned 6. Finally, `byte_size/1` reveals the number of underlying raw bytes needed to store the string when using UTF-8 encoding. UTF-8 requires one byte to represent the characters `h`, `e`, `l`, and `o`, but two bytes to represent the acute accent, adding to 7.
6466

6567
> Note: if you are running on Windows, there is a chance your terminal does not use UTF-8 by default. You can change the encoding of your current session by running `chcp 65001` before entering `iex` (`iex.bat`).
6668

0 commit comments

Comments
 (0)