Skip to content

Commit

Permalink
Update to latest confusables.txt
Browse files Browse the repository at this point in the history
This change simply updates the package to be based on the latest of
Unicode's `confusables.txt`.

Additionally, updated the README and Makefile to allow new contributors
to easily perform future updates.

NW
  • Loading branch information
woodgern committed Mar 24, 2021
1 parent a95a908 commit 19acf7e
Show file tree
Hide file tree
Showing 5 changed files with 49 additions and 11 deletions.
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ parse:
python3 confusables/parse.py

update:
wget -O confusables/assets/confusables.txt https://www.unicode.org/Public/security/12.1.0/confusables.txt

if [ -z "$(VERSION)" ]; then \
echo "Please specify the VERSION argument with a valid version of unicode confusables"; \
else \
wget -O confusables/assets/confusables.txt https://www.unicode.org/Public/security/${VERSION}/confusables.txt; \
fi
test:
python3 -m unittest discover

Expand Down
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,21 @@ print(normalize('Ʀỏ𝕍3ℛ', prioritize_alpha=False))
# prints ['r0v3r', 'r0ver', 'ro\'v3r', 'ro\'ver', 'rov3r', 'rover']
```

## Updating to the latest Unicode confusables version

If you find the latest version of this package to have an out of date version of the unicode official `confusables.txt`, then why not submit a PR to update it!

First, find out what the latest version of unicode confusables is. Then, run
```
make update VERSION=X.Y.Z
```

Next, run
```
make parse
```
And that's it! Commit your changes and create a pull request.

## About confusables

This module is something I put together because I'm interested in the field of language processing. I'm hoping to build out it's functionality, and I'm more than happy to take suggestions!
Expand Down
2 changes: 1 addition & 1 deletion confusables/assets/confusable_mapping.json

Large diffs are not rendered by default.

34 changes: 27 additions & 7 deletions confusables/assets/confusables.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# confusables.txt
# Date: 2019-04-01, 21:59:19 GMT
# © 2019 Unicode®, Inc.
# Date: 2020-02-13, 01:38:49 GMT
# © 2020 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Security Mechanisms for UTS #39
# Version: 12.1.0
# Version: 13.0.0
#
# For documentation and usage, see http://www.unicode.org/reports/tr39
#
Expand Down Expand Up @@ -1358,6 +1358,10 @@ FFED ; 25AA ; MA #* ( ■ → ▪ ) HALFWIDTH BLACK SQUARE → BLACK SMALL SQUAR

266A ; 1D158 1D165 1D16E ; MA #* ( ♪ → 𝅘𝅥𝅮 ) EIGHTH NOTE → MUSICAL SYMBOL NOTEHEAD BLACK, MUSICAL SYMBOL COMBINING STEM, MUSICAL SYMBOL COMBINING FLAG-1 #

24EA ; 1F10D ; MA #* ( ⓪ → 🄍 ) CIRCLED DIGIT ZERO → CIRCLED ZERO WITH SLASH #

21BA ; 1F10E ; MA #* ( ↺ → 🄎 ) ANTICLOCKWISE OPEN CIRCLE ARROW → CIRCLED ANTICLOCKWISE ARROW #

02D9 ; 0971 ; MA #* ( ˙ → ॱ ) DOT ABOVE → DEVANAGARI SIGN HIGH SPACING DOT #
0D4E ; 0971 ; MA # ( ൎ → ॱ ) MALAYALAM LETTER DOT REPH → DEVANAGARI SIGN HIGH SPACING DOT # →˙→

Expand Down Expand Up @@ -1390,6 +1394,8 @@ D7BC ; 30FC 1169 ; MA # ( ힼ → ーᅩ ) HANGUL JUNGSEONG EU-O → KATAKANA-HI

1197 ; 30FC 4E28 116E ; MA # ( ᆗ → ー丨ᅮ ) HANGUL JUNGSEONG YI-U → KATAKANA-HIRAGANA PROLONGED SOUND MARK, CJK UNIFIED IDEOGRAPH-4E28, HANGUL JUNGSEONG U # →ᅳᅵᅮ→

1F10F ; 0024 20E0 ; MA #* ( 🄏 → $⃠ ) CIRCLED DOLLAR SIGN WITH OVERLAID BACKSLASH → DOLLAR SIGN, COMBINING ENCLOSING CIRCLE BACKSLASH #

20A4 ; 00A3 ; MA #* ( ₤ → £ ) LIRA SIGN → POUND SIGN #

3012 ; 20B8 ; MA #* ( 〒 → ₸ ) POSTAL MARK → TENGE SIGN #
Expand All @@ -1416,6 +1422,7 @@ A9C6 ; A9D0 ; MA #* ( ꧆ → ꧐ ) JAVANESE PADA WINDU → JAVANESE DIGIT ZERO
1D7E4 ; 0032 ; MA # ( 𝟤 → 2 ) MATHEMATICAL SANS-SERIF DIGIT TWO → DIGIT TWO #
1D7EE ; 0032 ; MA # ( 𝟮 → 2 ) MATHEMATICAL SANS-SERIF BOLD DIGIT TWO → DIGIT TWO #
1D7F8 ; 0032 ; MA # ( 𝟸 → 2 ) MATHEMATICAL MONOSPACE DIGIT TWO → DIGIT TWO #
1FBF2 ; 0032 ; MA # ( 🯲 → 2 ) SEGMENTED DIGIT TWO → DIGIT TWO #
A75A ; 0032 ; MA # ( Ꝛ → 2 ) LATIN CAPITAL LETTER R ROTUNDA → DIGIT TWO #
01A7 ; 0032 ; MA # ( Ƨ → 2 ) LATIN CAPITAL LETTER TONE TWO → DIGIT TWO #
03E8 ; 0032 ; MA # ( Ϩ → 2 ) COPTIC CAPITAL LETTER HORI → DIGIT TWO # →Ƨ→
Expand Down Expand Up @@ -1488,6 +1495,7 @@ A9CF ; 0662 ; MA # ( ꧏ → ‎٢‎ ) JAVANESE PANGRANGKEP → ARABIC-INDIC DI
1D7E5 ; 0033 ; MA # ( 𝟥 → 3 ) MATHEMATICAL SANS-SERIF DIGIT THREE → DIGIT THREE #
1D7EF ; 0033 ; MA # ( 𝟯 → 3 ) MATHEMATICAL SANS-SERIF BOLD DIGIT THREE → DIGIT THREE #
1D7F9 ; 0033 ; MA # ( 𝟹 → 3 ) MATHEMATICAL MONOSPACE DIGIT THREE → DIGIT THREE #
1FBF3 ; 0033 ; MA # ( 🯳 → 3 ) SEGMENTED DIGIT THREE → DIGIT THREE #
A7AB ; 0033 ; MA # ( Ɜ → 3 ) LATIN CAPITAL LETTER REVERSED OPEN E → DIGIT THREE #
021C ; 0033 ; MA # ( Ȝ → 3 ) LATIN CAPITAL LETTER YOGH → DIGIT THREE # →Ʒ→
01B7 ; 0033 ; MA # ( Ʒ → 3 ) LATIN CAPITAL LETTER EZH → DIGIT THREE #
Expand Down Expand Up @@ -1526,6 +1534,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7E6 ; 0034 ; MA # ( 𝟦 → 4 ) MATHEMATICAL SANS-SERIF DIGIT FOUR → DIGIT FOUR #
1D7F0 ; 0034 ; MA # ( 𝟰 → 4 ) MATHEMATICAL SANS-SERIF BOLD DIGIT FOUR → DIGIT FOUR #
1D7FA ; 0034 ; MA # ( 𝟺 → 4 ) MATHEMATICAL MONOSPACE DIGIT FOUR → DIGIT FOUR #
1FBF4 ; 0034 ; MA # ( 🯴 → 4 ) SEGMENTED DIGIT FOUR → DIGIT FOUR #
13CE ; 0034 ; MA # ( Ꮞ → 4 ) CHEROKEE LETTER SE → DIGIT FOUR #
118AF ; 0034 ; MA # ( 𑢯 → 4 ) WARANG CITI CAPITAL LETTER UC → DIGIT FOUR #

Expand All @@ -1552,6 +1561,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7E7 ; 0035 ; MA # ( 𝟧 → 5 ) MATHEMATICAL SANS-SERIF DIGIT FIVE → DIGIT FIVE #
1D7F1 ; 0035 ; MA # ( 𝟱 → 5 ) MATHEMATICAL SANS-SERIF BOLD DIGIT FIVE → DIGIT FIVE #
1D7FB ; 0035 ; MA # ( 𝟻 → 5 ) MATHEMATICAL MONOSPACE DIGIT FIVE → DIGIT FIVE #
1FBF5 ; 0035 ; MA # ( 🯵 → 5 ) SEGMENTED DIGIT FIVE → DIGIT FIVE #
01BC ; 0035 ; MA # ( Ƽ → 5 ) LATIN CAPITAL LETTER TONE FIVE → DIGIT FIVE #
118BB ; 0035 ; MA # ( 𑢻 → 5 ) WARANG CITI CAPITAL LETTER HORR → DIGIT FIVE #

Expand All @@ -1572,6 +1582,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7E8 ; 0036 ; MA # ( 𝟨 → 6 ) MATHEMATICAL SANS-SERIF DIGIT SIX → DIGIT SIX #
1D7F2 ; 0036 ; MA # ( 𝟲 → 6 ) MATHEMATICAL SANS-SERIF BOLD DIGIT SIX → DIGIT SIX #
1D7FC ; 0036 ; MA # ( 𝟼 → 6 ) MATHEMATICAL MONOSPACE DIGIT SIX → DIGIT SIX #
1FBF6 ; 0036 ; MA # ( 🯶 → 6 ) SEGMENTED DIGIT SIX → DIGIT SIX #
2CD2 ; 0036 ; MA # ( Ⳓ → 6 ) COPTIC CAPITAL LETTER OLD COPTIC HEI → DIGIT SIX #
0431 ; 0036 ; MA # ( б → 6 ) CYRILLIC SMALL LETTER BE → DIGIT SIX #
13EE ; 0036 ; MA # ( Ꮾ → 6 ) CHEROKEE LETTER WV → DIGIT SIX #
Expand Down Expand Up @@ -1599,6 +1610,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7E9 ; 0037 ; MA # ( 𝟩 → 7 ) MATHEMATICAL SANS-SERIF DIGIT SEVEN → DIGIT SEVEN #
1D7F3 ; 0037 ; MA # ( 𝟳 → 7 ) MATHEMATICAL SANS-SERIF BOLD DIGIT SEVEN → DIGIT SEVEN #
1D7FD ; 0037 ; MA # ( 𝟽 → 7 ) MATHEMATICAL MONOSPACE DIGIT SEVEN → DIGIT SEVEN #
1FBF7 ; 0037 ; MA # ( 🯷 → 7 ) SEGMENTED DIGIT SEVEN → DIGIT SEVEN #
104D2 ; 0037 ; MA # ( 𐓒 → 7 ) OSAGE CAPITAL LETTER ZA → DIGIT SEVEN #
118C6 ; 0037 ; MA # ( 𑣆 → 7 ) WARANG CITI SMALL LETTER II → DIGIT SEVEN #

Expand All @@ -1623,6 +1635,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7EA ; 0038 ; MA # ( 𝟪 → 8 ) MATHEMATICAL SANS-SERIF DIGIT EIGHT → DIGIT EIGHT #
1D7F4 ; 0038 ; MA # ( 𝟴 → 8 ) MATHEMATICAL SANS-SERIF BOLD DIGIT EIGHT → DIGIT EIGHT #
1D7FE ; 0038 ; MA # ( 𝟾 → 8 ) MATHEMATICAL MONOSPACE DIGIT EIGHT → DIGIT EIGHT #
1FBF8 ; 0038 ; MA # ( 🯸 → 8 ) SEGMENTED DIGIT EIGHT → DIGIT EIGHT #
0223 ; 0038 ; MA # ( ȣ → 8 ) LATIN SMALL LETTER OU → DIGIT EIGHT #
0222 ; 0038 ; MA # ( Ȣ → 8 ) LATIN CAPITAL LETTER OU → DIGIT EIGHT #
1031A ; 0038 ; MA # ( 𐌚 → 8 ) OLD ITALIC LETTER EF → DIGIT EIGHT #
Expand Down Expand Up @@ -1650,6 +1663,7 @@ A76A ; 0033 ; MA # ( Ꝫ → 3 ) LATIN CAPITAL LETTER ET → DIGIT THREE #
1D7EB ; 0039 ; MA # ( 𝟫 → 9 ) MATHEMATICAL SANS-SERIF DIGIT NINE → DIGIT NINE #
1D7F5 ; 0039 ; MA # ( 𝟵 → 9 ) MATHEMATICAL SANS-SERIF BOLD DIGIT NINE → DIGIT NINE #
1D7FF ; 0039 ; MA # ( 𝟿 → 9 ) MATHEMATICAL MONOSPACE DIGIT NINE → DIGIT NINE #
1FBF9 ; 0039 ; MA # ( 🯹 → 9 ) SEGMENTED DIGIT NINE → DIGIT NINE #
A76E ; 0039 ; MA # ( Ꝯ → 9 ) LATIN CAPITAL LETTER CON → DIGIT NINE #
2CCA ; 0039 ; MA # ( Ⳋ → 9 ) COPTIC CAPITAL LETTER DIALECT-P HORI → DIGIT NINE #
118CC ; 0039 ; MA # ( 𑣌 → 9 ) WARANG CITI SMALL LETTER KO → DIGIT NINE #
Expand Down Expand Up @@ -1912,6 +1926,8 @@ A4DA ; 0043 ; MA # ( ꓚ → C ) LISU LETTER CA → LATIN CAPITAL LETTER C #

20A1 ; 0043 20EB ; MA #* ( ₡ → C⃫ ) COLON SIGN → LATIN CAPITAL LETTER C, COMBINING LONG DOUBLE SOLIDUS OVERLAY #

1F16E ; 0043 20E0 ; MA #* ( 🅮 → C⃠ ) CIRCLED C WITH OVERLAID BACKSLASH → LATIN CAPITAL LETTER C, COMBINING ENCLOSING CIRCLE BACKSLASH #

00E7 ; 0063 0326 ; MA # ( ç → c̦ ) LATIN SMALL LETTER C WITH CEDILLA → LATIN SMALL LETTER C, COMBINING COMMA BELOW # →ҫ→→с̡→
04AB ; 0063 0326 ; MA # ( ҫ → c̦ ) CYRILLIC SMALL LETTER ES WITH DESCENDER → LATIN SMALL LETTER C, COMBINING COMMA BELOW # →с̡→

Expand All @@ -1924,6 +1940,8 @@ A4DA ; 0043 ; MA # ( ꓚ → C ) LISU LETTER CA → LATIN CAPITAL LETTER C #

2106 ; 0063 002F 0075 ; MA #* ( ℆ → c/u ) CADA UNA → LATIN SMALL LETTER C, SOLIDUS, LATIN SMALL LETTER U #

1F16D ; 33C4 0009 20DD ; MA #* ( 🅭 → ) CIRCLED CC → SQUARE CC, <CHARACTER TABULATION>, COMBINING ENCLOSING CIRCLE #

22F4 ; A793 ; MA #* ( ⋴ → ꞓ ) SMALL ELEMENT OF WITH VERTICAL BAR AT END OF HORIZONTAL STROKE → LATIN SMALL LETTER C WITH BAR # →ɛ→→є→
025B ; A793 ; MA # ( ɛ → ꞓ ) LATIN SMALL LETTER OPEN E → LATIN SMALL LETTER C WITH BAR # →є→
03B5 ; A793 ; MA # ( ε → ꞓ ) GREEK SMALL LETTER EPSILON → LATIN SMALL LETTER C WITH BAR # →є→
Expand Down Expand Up @@ -2353,7 +2371,7 @@ A6B1 ; 2C75 ; MA # ( ꚱ → Ⱶ ) BAMUM LETTER NDAA → LATIN CAPITAL LETTER HA
A795 ; A727 ; MA # ( ꞕ → ꜧ ) LATIN SMALL LETTER H WITH PALATAL HOOK → LATIN SMALL LETTER HENG #

02DB ; 0069 ; MA #* ( ˛ → i ) OGONEK → LATIN SMALL LETTER I # →ͺ→→ι→→ι→
2373 ; 0069 ; MA #* ( ⍳ → i ) APL FUNCTIONAL SYMBOL IOTA → LATIN SMALL LETTER I # →ι
2373 ; 0069 ; MA #* ( ⍳ → i ) APL FUNCTIONAL SYMBOL IOTA → LATIN SMALL LETTER I # →ɩ
FF49 ; 0069 ; MA # ( i → i ) FULLWIDTH LATIN SMALL LETTER I → LATIN SMALL LETTER I # →і→
2170 ; 0069 ; MA # ( ⅰ → i ) SMALL ROMAN NUMERAL ONE → LATIN SMALL LETTER I #
2139 ; 0069 ; MA # ( ℹ → i ) INFORMATION SOURCE → LATIN SMALL LETTER I #
Expand Down Expand Up @@ -2530,6 +2548,7 @@ FFE8 ; 006C ; MA #* ( │ → l ) HALFWIDTH FORMS LIGHT VERTICAL → LATIN SMALL
1D7E3 ; 006C ; MA # ( 𝟣 → l ) MATHEMATICAL SANS-SERIF DIGIT ONE → LATIN SMALL LETTER L # →1→
1D7ED ; 006C ; MA # ( 𝟭 → l ) MATHEMATICAL SANS-SERIF BOLD DIGIT ONE → LATIN SMALL LETTER L # →1→
1D7F7 ; 006C ; MA # ( 𝟷 → l ) MATHEMATICAL MONOSPACE DIGIT ONE → LATIN SMALL LETTER L # →1→
1FBF1 ; 006C ; MA # ( 🯱 → l ) SEGMENTED DIGIT ONE → LATIN SMALL LETTER L # →1→
0049 ; 006C ; MA # ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L #
FF29 ; 006C ; MA # ( I → l ) FULLWIDTH LATIN CAPITAL LETTER I → LATIN SMALL LETTER L # →Ӏ→
2160 ; 006C ; MA # ( Ⅰ → l ) ROMAN NUMERAL ONE → LATIN SMALL LETTER L # →Ӏ→
Expand Down Expand Up @@ -2957,6 +2976,7 @@ FBA6 ; 006F ; MA # ( ‎ﮦ‎ → o ) ARABIC LETTER HEH GOAL ISOLATED FORM →
1D7E2 ; 004F ; MA # ( 𝟢 → O ) MATHEMATICAL SANS-SERIF DIGIT ZERO → LATIN CAPITAL LETTER O # →0→
1D7EC ; 004F ; MA # ( 𝟬 → O ) MATHEMATICAL SANS-SERIF BOLD DIGIT ZERO → LATIN CAPITAL LETTER O # →0→
1D7F6 ; 004F ; MA # ( 𝟶 → O ) MATHEMATICAL MONOSPACE DIGIT ZERO → LATIN CAPITAL LETTER O # →0→
1FBF0 ; 004F ; MA # ( 🯰 → O ) SEGMENTED DIGIT ZERO → LATIN CAPITAL LETTER O # →0→
FF2F ; 004F ; MA # ( O → O ) FULLWIDTH LATIN CAPITAL LETTER O → LATIN CAPITAL LETTER O # →О→
1D40E ; 004F ; MA # ( 𝐎 → O ) MATHEMATICAL BOLD CAPITAL O → LATIN CAPITAL LETTER O #
1D442 ; 004F ; MA # ( 𝑂 → O ) MATHEMATICAL ITALIC CAPITAL O → LATIN CAPITAL LETTER O #
Expand Down Expand Up @@ -8008,8 +8028,6 @@ FA92 ; 6717 ; MA # ( 朗 → 朗 ) CJK COMPATIBILITY IDEOGRAPH-FA92 → CJK UNIF
FA93 ; 671B ; MA # ( 望 → 望 ) CJK COMPATIBILITY IDEOGRAPH-FA93 → CJK UNIFIED IDEOGRAPH-671B #
2F8D9 ; 671B ; MA # ( 望 → 望 ) CJK COMPATIBILITY IDEOGRAPH-2F8D9 → CJK UNIFIED IDEOGRAPH-671B #

2F8DA ; 6721 ; MA # ( 朡 → 朡 ) CJK COMPATIBILITY IDEOGRAPH-2F8DA → CJK UNIFIED IDEOGRAPH-6721 #

5E50 ; 3B3A ; MA # ( 幐 → 㬺 ) CJK UNIFIED IDEOGRAPH-5E50 → CJK UNIFIED IDEOGRAPH-3B3A #

4420 ; 3B3B ; MA # ( 䐠 → 㬻 ) CJK UNIFIED IDEOGRAPH-4420 → CJK UNIFIED IDEOGRAPH-3B3B #
Expand Down Expand Up @@ -8815,6 +8833,8 @@ F953 ; 808B ; MA # ( 肋 → 肋 ) CJK COMPATIBILITY IDEOGRAPH-F953 → CJK UNIF

2F984 ; 440B ; MA # ( 䐋 → 䐋 ) CJK COMPATIBILITY IDEOGRAPH-2F984 → CJK UNIFIED IDEOGRAPH-440B #

2F8DA ; 6721 ; MA # ( 朡 → 朡 ) CJK COMPATIBILITY IDEOGRAPH-2F8DA → CJK UNIFIED IDEOGRAPH-6721 #

2F987 ; 267A7 ; MA # ( 𦞧 → 𦞧 ) CJK COMPATIBILITY IDEOGRAPH-2F987 → CJK UNIFIED IDEOGRAPH-267A7 #

2F988 ; 267B5 ; MA # ( 𦞵 → 𦞵 ) CJK COMPATIBILITY IDEOGRAPH-2F988 → CJK UNIFIED IDEOGRAPH-267B5 #
Expand Down Expand Up @@ -9614,5 +9634,5 @@ FACE ; 9F9C ; MA # ( 龜 → 龜 ) CJK COMPATIBILITY IDEOGRAPH-FACE → CJK UNIF

2FD5 ; 9FA0 ; MA #* ( ⿕ → 龠 ) KANGXI RADICAL FLUTE → CJK UNIFIED IDEOGRAPH-9FA0 #

# total: 6296
# total: 6311

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setuptools.setup(
name="confusables",
version="1.1.1",
version="1.2.0",
author="Nathan Woodger",
author_email="[email protected]",
description="A python package providing functionality for matching words that can be confused for eachother, but contain different characters",
Expand Down

0 comments on commit 19acf7e

Please sign in to comment.