# بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

We need to turn characters into numbers. We can do that with Unicode like this.

# Unicode

On Ubuntu Linux press [CTRL][Shift][T] and in the terminal:

**schmuck@Schmoe:~$** `python3`

to get **>>>**

`ord('h')`

'h' has the Unicode *code point* 104. `ord` can only take a single character, to get the code points of many characters:

`[ord(x) for x in "إِنَّ اللَّهَ اصْطَفَىٰ آدَمَ وَنُوحًا وَآلَ إِبْرَاهِيمَ وَآلَ عِمْرَانَ"]`

[1573, 1616, 1606, 1617, 1614, 32, 1575, 1604, 1604, 1617, 1614, 1607, 1614, 32, 1575, 1589, 1618, 1591, 1614, 1601, 1614, 1609, 1648, 32, 1570, 1583, 1614, 1605, 1614, 32, 1608, 1614, 1606, 1615, 1608, 1581, 1611, 1575, 32, 1608, 1614, 1570, 1604, 1614, 32, 1573, 1616, 1576, 1618, 1585, 1614, 1575, 1607, 1616, 1610, 1605, 1614, 32, 1608, 1614, 1570, 1604, 1614, 32, 1593, 1616, 1605, 1618, 1585, 1614, 1575, 1606, 1614]

But Unicode is always changing so not very good for us. We can use a type of Unicode called *UTF-8* which can turn our characters into *binary-data* or *byte-streams* like this:

`"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8")`

b'\xd8\xa8\xd9\x90\xd8\xb3\xd9\x92\xd9\x85\xd9\x90 \xd8\xa7\xd9\x84\xd9\x84\xd9\x91\xd9\x8e\xd9\x87\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x92\xd9\x85\xd9\x8e\xd9\xb0\xd9\x86\xd9\x90 \xd8\xa7\xd9\x84\xd8\xb1\xd9\x91\xd9\x8e\xd8\xad\xd9\x90\xd9\x8a\xd9\x85\xd9\x90'

but it is not very pretty so we can turn it into useful numbers to work with like this:

`list("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ".encode("utf-8"))`

[216, 168, 217, 144, 216, 179, 217, 146, 217, 133, 217, 144, 32, 216, 167, 217, 132, 217, 132, 217, 145, 217, 142, 217, 135, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 146, 217, 133, 217, 142, 217, 176, 217, 134, 217, 144, 32, 216, 167, 217, 132, 216, 177, 217, 145, 217, 142, 216, 173, 217, 144, 217, 138, 217, 133, 217, 144]

## Getting the data

Go to https://tanzil.net/download/ and choose 'Uthmani' under *Quran text type:* , 'Text' for *Output file format:* and tick all boxes except 'Include sequential tanweens', then **Download** to get the file and then using python3 to open:

`text = open("quran-uthmani.txt", 'r').read()`

`print(text)`

We can encode in UTF-8 and get some ugly binary:

`tokens = text.encode("utf-8")`

`print(tokens)`

A neater way, so we get 0-255 range of code points:

`tokens = list(map(int, tokens))`

`quit()` to get out of >>> and back to $

## Google Colab

If you have a Google account, or if you have a throw away SIM card to get a new Google account using a dumb phone for SMS verification, you can start using Google's Colab for **free**:

https://colab.research.google.com/

Yes, for **free** so when at the payment screen, click away from it to get the free lab.

## Adding your file to Google Colab

Click the file icon and then the dog-earred page with an up arrow and add your *quran-uthmani.txt* file from 'Getting the data' (above).

![image]()

It *will be deleted* so you have to add it every session you start Colab, so have it saved somewhere.

## Our actor example applied to Python

So, we talked earlier about Fenyman and James Clear's 'Atomic Habits' and how actors learn their script.

We are now going to use that to learn our own 'scripting', that is learn programming:

*Type **everything out** and **never ever copy** and paste code. And never say never.*

We said we won't type/write by going back and forth but in this case we will except that we type everything out. It will first enter our memory so we know where to find it when we need it in the future and for speed we will **type first** because searching for something we saw months ago somewhere takes longer than trying muscle memory out and just typing it.

We say, 'never say never' because there are times when there is no point in typing out long paragraphs of code that are 'boiler plate' meaning, always used as is everywhere. In those **rare** cases, copy and paste.

## Getting the most common pairs

Type in Google Colab `text = open("quran-uthmani.txt", 'r').read()` and then press [SHIFT][ENTER] to run the 'code cell'.

`open()` `"quran-uthmani.txt"` as read-only `'r'` and save as `text`.

`tokens = text.encode("utf-8")`

take `text` and `encode()` it with `"utf-8"` and save as `tokens`

`tokens = list(map(int, tokens))`

```

def get_stats(ids):

counts = {}

for pair in zip(ids, ids[1:]):

counts[pair] = counts.get(pair, 0) + 1

return counts

stats = get_stats(tokens)

print(stats)

```

{(216, 168): 11603, (168, 217): 11593, (217, 144): 46642, (144, 216): 7778, (216, 179): 6122, (179, 217): 6122, (217, 146): 37372, (146, 217): 14675, (217, 133): 27071, (133, 217): 25740, (144, 32): 7712, (32, 217): 45082, (217, 177): 13819, (177, 217): 25239, (217, 132): 38550, (132, 217): 36124, (217, 145): 23016, (145, 217): 23016, (217, 142): 123396, (142, 217): 53930, (217, 135): 14962, (135, 217): 14961, (132, 216): 2316, (216, 177): 12627, (142, 216): 50698, (216, 173): 4364, (173, 217): 4364, (217, 128): 6848, (128, 217): 6808, (217, 176): 9838, (176, 217): 10000, (217, 134): 27380, (134, 217): 22530, (144, 217): 29600, (217, 138): 18334, (138, 217): 16706, (144, 10): 595, (10, 217): 4481, (146, 216): 13852, (216, 175): 5991, (175, 217): 5945, (217, 143): 37320, (143, 32): 6675, (32, 216): 26762, (216, 185): 9405, (185, 217): 9403, (142, 10): 2843, (217, 131): 10497, (131, 217): 10497, (217, 136): 24970, (136, 217): 20377, (10, 216): 1555, (216, 165): 5088, (165, 217): 5260, (216, 167): 25184, (167, 217): 6829, (142, 32): 15924, (143, 216): 5806, (216, 170): 10520, (170, 217): 10504, (143, 10): 331, (167, 32): 10720, (216, 181): 2074, (181, 217): 2071, (176, 216): 3255, (216, 183): 1273, (183, 217): 1266, (217, 130): 7034, (130, 217): 7034, (216, 176): 4932, (216, 163): 8900, (163, 217): 8901, (146, 32): 8751, (216, 186): 1221, (186, 217): 1221, (216, 182): 1686, (182, 217): 1686, (143, 217): 23252, (136, 216): 4289, (217, 147): 5376, (147, 217): 90, (147, 10): 76, (32, 219): 4379, (219, 155): 12, (155, 32): 12, (217, 129): 8747, (129, 217): 8746, (217, 139): 3741, (139, 217): 93, (217, 137): 6603, (137, 32): 3035, (216, 164): 706, (164, 217): 706, (216, 169): 2344, (169, 217): 2344, (216, 178): 1599, (178, 217): 1599, (147, 32): 2459, (134, 216): 1499, (134, 32): 3081, (217, 148): 773, (148, 217): 773, (167, 216): 2953, (216, 174): 2497, (174, 217): 2497, (136, 219): 255, (219, 159): 3988, (159, 217): 268, (147, 216): 2751, (216, 166): 921, (166, 217): 1085, (137, 217): 3531, (176, 32): 1337, (219, 150): 1682, (150, 32): 1682, (167, 219): 3789, (159, 32): 3704, (216, 161): 2782, (161, 217): 2782, (217, 140): 2519, (140, 32): 1777, (216, 180): 2124, (180, 217): 2124, (216, 184): 853, (184, 217): 853, (140, 10): 605, (133, 32): 1328, (139, 216): 2976, (140, 219): 134, (219, 162): 510, (162, 32): 338, (219, 151): 603, (151, 32): 603, (177, 216): 1197, (170, 32): 17, (216, 172): 3317, (172, 217): 3317, (216, 171): 1414, (171, 217): 1414, (143, 219): 1256, (219, 165): 1257, (165, 32): 1042, (217, 141): 2633, (141, 32): 2080, (219, 154): 1972, (154, 32): 1972, (138, 216): 1618, (139, 32): 556, (144, 219): 957, (219, 166): 957, (166, 32): 791, (219, 153): 68, (153, 32): 68, (10, 219): 199, (219, 158): 199, (158, 32): 199, (219, 152): 22, (152, 32): 22, (134, 219): 270, (162, 216): 158, (132, 32): 110, (141, 10): 454, (141, 219): 99, (219, 173): 99, (173, 32): 84, (168, 32): 9, (136, 32): 49, (128, 219): 40, (219, 167): 38, (139, 219): 106, (175, 216): 38, (181, 219): 3, (219, 156): 7, (156, 217): 2, (165, 216): 19, (175, 32): 8, (219, 160): 66, (160, 32): 62, (159, 216): 14, (177, 32): 9, (138, 219): 10, (167, 10): 931, (159, 10): 2, (140, 216): 3, (162, 10): 14, (183, 216): 7, (137, 219): 1, (171, 32): 1, (219, 169): 15, (169, 10): 15, (177, 219): 1, (219, 170): 1, (142, 219): 1, (219, 171): 1, (129, 32): 1, (156, 10): 2, (137, 10): 36, (185, 32): 2, (135, 10): 1, (176, 10): 178, (146, 10): 94, (219, 168): 1, (168, 216): 1, (160, 10): 4, (173, 10): 15, (156, 32): 3, (219, 172): 1, (172, 216): 1, (133, 10): 3, (219, 163): 1, (139, 10): 10, (165, 10): 24, (166, 10): 2, (168, 10): 1, (10, 10): 2, (10, 35): 28, (35, 32): 18, (32, 80): 6, (80, 76): 1, (76, 69): 1, (69, 65): 1, (65, 83): 1, (83, 69): 2, (69, 32): 3, (32, 68): 1, (68, 79): 1, (79, 32): 1, (32, 78): 2, (78, 79): 2, (79, 84): 2, (84, 32): 4, (32, 82): 1, (82, 69): 1, (69, 77): 1, (77, 79): 1, (79, 86): 1, (86, 69): 1, (32, 79): 2, (79, 82): 1, (82, 32): 1, (32, 67): 6, (67, 72): 2, (72, 65): 2, (65, 78): 2, (78, 71): 3, (71, 69): 1, (32, 84): 9, (84, 72): 1, (72, 73): 1, (73, 83): 2, (83, 32): 3, (67, 79): 1, (79, 80): 1, (80, 89): 1, (89, 82): 1, (82, 73): 1, (73, 71): 1, (71, 72): 1, (72, 84): 1, (32, 66): 1, (66, 76): 1, (76, 79): 2, (79, 67): 1, (67, 75): 1, (75, 10): 1, (35, 61): 2, (61, 61): 134, (61, 10): 2, (35, 10): 8, (32, 32): 29, (84, 97): 4, (97, 110): 19, (110, 122): 6, (122, 105): 6, (105, 108): 7, (108, 32): 9, (32, 81): 3, (81, 117): 3, (117, 114): 4, (114, 97): 5, (110, 32): 11, (84, 101): 1, (101, 120): 6, (120, 116): 6, (116, 32): 9, (32, 40): 3, (40, 85): 1, (85, 116): 1, (116, 104): 6, (104, 109): 1, (109, 97): 2, (110, 105): 3, (105, 44): 1, (44, 32): 6, (32, 86): 1, (86, 101): 1, (101, 114): 7, (114, 115): 2, (115, 105): 3, (105, 111): 5, (111, 110): 9, (32, 49): 1, (49, 46): 1, (46, 49): 1, (49, 41): 1, (41, 10): 1, (67, 111): 2, (111, 112): 7, (112, 121): 4, (121, 114): 2, (114, 105): 7, (105, 103): 3, (103, 104): 3, (104, 116): 3, (40, 67): 1, (67, 41): 1, (41, 32): 2, (32, 50): 1, (50, 48): 2, (48, 48): 1, (48, 55): 1, (55, 45): 1, (45, 50): 1, (48, 50): 1, (50, 52): 1, (52, 32): 1, (80, 114): 3, (114, 111): 9, (111, 106): 3, (106, 101): 3, (101, 99): 5, (99, 116): 3, (116, 10): 1, (32, 76): 1, (76, 105): 1, (105, 99): 4, (99, 101): 5, (101, 110): 2, (110, 115): 2, (115, 101): 4, (101, 58): 1, (58, 32): 2, (67, 114): 1, (114, 101): 4, (101, 97): 3, (97, 116): 11, (116, 105): 9, (105, 118): 2, (118, 101): 5, (101, 32): 13, (111, 109): 2, (109, 109): 1, (109, 111): 2, (115, 32): 17, (32, 65): 2, (65, 116): 1, (116, 116): 2, (116, 114): 3, (105, 98): 2, (98, 117): 3, (117, 116): 3, (32, 51): 1, (51, 46): 1, (46, 48): 1, (48, 10): 1, (84, 104): 3, (104, 105): 6, (105, 115): 12, (32, 99): 12, (99, 111): 7, (121, 32): 9, (32, 111): 8, (111, 102): 6, (102, 32): 6, (32, 116): 16, (104, 101): 3, (116, 101): 12, (32, 105): 10, (99, 97): 4, (97, 114): 2, (101, 102): 1, (102, 117): 1, (117, 108): 1, (108, 108): 5, (108, 121): 5, (32, 112): 3, (112, 114): 5, (111, 100): 2, (100, 117): 2, (117, 99): 2, (101, 100): 10, (100, 44): 2, (32, 104): 2, (104, 108): 1, (32, 10): 7, (32, 118): 3, (105, 102): 1, (102, 105): 2, (105, 101): 3, (100, 32): 12, (32, 97): 13, (110, 100): 5, (110, 116): 4, (105, 110): 9, (110, 117): 1, (117, 111): 1, (111, 117): 3, (117, 115): 3, (115, 108): 1, (32, 109): 2, (105, 116): 3, (116, 111): 5, (111, 114): 4, (32, 98): 5, (98, 121): 1, (97, 32): 2, (32, 103): 2, (103, 114): 2, (117, 112): 3, (112, 32): 1, (32, 115): 5, (115, 112): 1, (112, 101): 1, (99, 105): 1, (105, 97): 3, (97, 108): 6, (108, 105): 3, (115, 116): 3, (116, 115): 2, (116, 46): 2, (46, 10): 4, (84, 69): 1, (69, 82): 1, (82, 77): 1, (77, 83): 1, (79, 70): 1, (70, 32): 1, (32, 85): 1, (85, 83): 1, (69, 58): 1, (58, 10): 1, (32, 45): 3, (45, 32): 3, (80, 101): 1, (114, 109): 1, (109, 105): 1, (115, 115): 1, (111, 32): 4, (32, 100): 2, (100, 105): 2, (114, 98): 2, (98, 97): 2, (105, 109): 2, (109, 32): 3, (112, 105): 2, (101, 115): 6, (116, 44): 2, (71, 73): 1, (73, 78): 1, (71, 32): 1, (32, 73): 2, (73, 84): 1, (65, 76): 1, (76, 76): 1, (79, 87): 1, (87, 69): 1, (69, 68): 1, (68, 46): 1, (98, 101): 3, (32, 117): 3, (110, 121): 1, (32, 119): 1, (119, 101): 1, (101, 98): 1, (98, 115): 2, (114, 32): 2, (97, 112): 2, (112, 112): 2, (112, 108): 1, (110, 44): 1, (111, 118): 1, (118, 105): 1, (105, 100): 1, (100, 101): 4, (104, 97): 4, (115, 111): 1, (114, 99): 1, (40, 84): 1, (116, 41): 1, (99, 108): 2, (108, 101): 4, (114, 108): 1, (32, 108): 1, (110, 107): 1, (107, 32): 3, (97, 100): 1, (116, 97): 4, (108, 46): 2, (46, 110): 2, (110, 101): 2, (101, 116): 2, (32, 101): 1, (110, 97): 1, (97, 98): 1, (98, 108): 1, (32, 107): 1, (107, 101): 1, (101, 101): 1, (101, 112): 2, (112, 10): 1, (97, 99): 1, (99, 107): 2, (99, 104): 2, (110, 103): 2, (103, 101): 1, (115, 46): 1, (32, 110): 1, (110, 111): 1, (111, 116): 1, (115, 104): 2, (110, 99): 1, (108, 117): 1, (117, 100): 1, (32, 114): 1, (101, 108): 1, (32, 102): 2, (102, 114): 1, (97, 105): 1, (103, 32): 1, (115, 117): 1, (117, 98): 1, (112, 111): 1, (114, 116): 1, (80, 108): 1, (97, 115): 1, (112, 100): 2, (100, 97): 2, (116, 58): 1, (116, 112): 1, (112, 58): 1, (58, 47): 1, (47, 47): 1, (47, 116): 1, (116, 47): 1, (47, 117): 1, (115, 47): 1, (47, 10): 1}

## The most common pair is ...

`print(sorted(((v,k) for k,v in stats.items()), reverse=True))`

`sorted` by value`v` to get the most common pairs first

[(123396, (217, 142)), (53930, (142, 217)), (50698, (142, 216)), (46642, (217, 144)), (45082, (32, 217)), (38550, (217, 132)), (37372, (217, 146)), (37320, (217, 143)), (36124, (132, 217)), (29600, (144, 217)), (27380, (217, 134)), (27071, (217, 133)), (26762, (32, 216)), (25740, (133, 217)), (25239, (177, 217)), (25184, (216, 167)), (24970, (217, 136)), (23252, (143, 217)), (23016, (217, 145)), (23016, (145, 217)), (22530, (134, 217)), (20377, (136, 217)), (18334, (217, 138)), (16706, (138, 217)), (15924, (142, 32)), (14962, (217, 135)), (14961, (135, 217)), (14675, (146, 217)), (13852, (146, 216)), (13819, (217, 177)), (12627, (216, 177)), (11603, (216, 168)), (11593, (168, 217)), (10720, (167, 32)), (10520, (216, 170)), (10504, (170, 217)), (10497, (217, 131)), (10497, (131, 217)), (10000, (176, 217)), (9838, (217, 176)), (9405, (216, 185)), (9403, (185, 217)), (8901, (163, 217)), (8900, (216, 163)), (8751, (146, 32)), (8747, (217, 129)), (8746, (129, 217)), (7778, (144, 216)), (7712, (144, 32)), (7034, (217, 130)), (7034, (130, 217)), (6848, (217, 128)), (6829, (167, 217)), (6808, (128, 217)), (6675, (143, 32)), (6603, (217, 137)), (6122, (216, 179)), (6122, (179, 217)), (5991, (216, 175)), (5945, (175, 217)), (5806, (143, 216)), (5376, (217, 147)), (5260, (165, 217)), (5088, (216, 165)), (4932, (216, 176)), (4481, (10, 217)), (4379, (32, 219)), (4364, (216, 173)), (4364, (173, 217)), (4289, (136, 216)), (3988, (219, 159)), (3789, (167, 219)), (3741, (217, 139)), (3704, (159, 32)), (3531, (137, 217)), (3317, (216, 172)), (3317, (172, 217)), (3255, (176, 216)), (3081, (134, 32)), (3035, (137, 32)), (2976, (139, 216)), (2953, (167, 216)), (2843, (142, 10)), (2782, (216, 161)), (2782, (161, 217)), (2751, (147, 216)), (2633, (217, 141)), (2519, (217, 140)), (2497, (216, 174)), (2497, (174, 217)), (2459, (147, 32)), (2344, (216, 169)), (2344, (169, 217)), (2316, (132, 216)), (2124, (216, 180)), (2124, (180, 217)), (2080, (141, 32)), (2074, (216, 181)), (2071, (181, 217)), (1972, (219, 154)), (1972, (154, 32)), (1777, (140, 32)), (1686, (216, 182)), (1686, (182, 217)), (1682, (219, 150)), (1682, (150, 32)), (1618, (138, 216)), (1599, (216, 178)), (1599, (178, 217)), (1555, (10, 216)), (1499, (134, 216)), (1414, (216, 171)), (1414, (171, 217)), (1337, (176, 32)), (1328, (133, 32)), (1273, (216, 183)), (1266, (183, 217)), (1257, (219, 165)), (1256, (143, 219)), (1221, (216, 186)), (1221, (186, 217)), (1197, (177, 216)), (1085, (166, 217)), (1042, (165, 32)), (957, (219, 166)), (957, (144, 219)), (931, (167, 10)), (921, (216, 166)), (853, (216, 184)), (853, (184, 217)), (791, (166, 32)), (773, (217, 148)), (773, (148, 217)), (706, (216, 164)), (706, (164, 217)), (605, (140, 10)), (603, (219, 151)), (603, (151, 32)), (595, (144, 10)), (556, (139, 32)), (510, (219, 162)), (454, (141, 10)), (338, (162, 32)), (331, (143, 10)), (270, (134, 219)), (268, (159, 217)), (255, (136, 219)), (199, (219, 158)), (199, (158, 32)), (199, (10, 219)), (178, (176, 10)), (158, (162, 216)), (134, (140, 219)), (134, (61, 61)), (110, (132, 32)), (106, (139, 219)), (99, (219, 173)), (99, (141, 219)), (94, (146, 10)), (93, (139, 217)), (90, (147, 217)), (84, (173, 32)), (76, (147, 10)), (68, (219, 153)), (68, (153, 32)), (66, (219, 160)), (62, (160, 32)), (49, (136, 32)), (40, (128, 219)), (38, (219, 167)), (38, (175, 216)), (36, (137, 10)), (29, (32, 32)), (28, (10, 35)), (24, (165, 10)), (22, (219, 152)), (22, (152, 32)), (19, (165, 216)), (19, (97, 110)), (18, (35, 32)), (17, (170, 32)), (17, (115, 32)), (16, (32, 116)), (15, (219, 169)), (15, (173, 10)), (15, (169, 10)), (14, (162, 10)), (14, (159, 216)), (13, (101, 32)), (13, (32, 97)), (12, (219, 155)), (12, (155, 32)), (12, (116, 101)), (12, (105, 115)), (12, (100, 32)), (12, (32, 99)), (11, (110, 32)), (11, (97, 116)), (10, (139, 10)), (10, (138, 219)), (10, (101, 100)), (10, (32, 105)), (9, (177, 32)), (9, (168, 32)), (9, (121, 32)), (9, (116, 105)), (9, (116, 32)), (9, (114, 111)), (9, (111, 110)), (9, (108, 32)), (9, (105, 110)), (9, (32, 84)), (8, (175, 32)), (8, (35, 10)), (8, (32, 111)), (7, (219, 156)), (7, (183, 216)), (7, (114, 105)), (7, (111, 112)), (7, (105, 108)), (7, (101, 114)), (7, (99, 111)), (7, (32, 10)), (6, (122, 105)), (6, (120, 116)), (6, (116, 104)), (6, (111, 102)), (6, (110, 122)), (6, (104, 105)), (6, (102, 32)), (6, (101, 120)), (6, (101, 115)), (6, (97, 108)), (6, (44, 32)), (6, (32, 80)), (6, (32, 67)), (5, (118, 101)), (5, (116, 111)), (5, (114, 97)), (5, (112, 114)), (5, (110, 100)), (5, (108, 121)), (5, (108, 108)), (5, (105, 111)), (5, (101, 99)), (5, (99, 101)), (5, (32, 115)), (5, (32, 98)), (4, (160, 10)), (4, (117, 114)), (4, (116, 97)), (4, (115, 101)), (4, (114, 101)), (4, (112, 121)), (4, (111, 114)), (4, (111, 32)), (4, (110, 116)), (4, (108, 101)), (4, (105, 99)), (4, (104, 97)), (4, (100, 101)), (4, (99, 97)), (4, (84, 97)), (4, (84, 32)), (4, (46, 10)), (3, (181, 219)), (3, (156, 32)), (3, (140, 216)), (3, (133, 10)), (3, (117, 116)), (3, (117, 115)), (3, (117, 112)), (3, (116, 114)), (3, (115, 116)), (3, (115, 105)), (3, (111, 117)), (3, (111, 106)), (3, (110, 105)), (3, (109, 32)), (3, (108, 105)), (3, (107, 32)), (3, (106, 101)), (3, (105, 116)), (3, (105, 103)), (3, (105, 101)), (3, (105, 97)), (3, (104, 116)), (3, (104, 101)), (3, (103, 104)), (3, (101, 97)), (3, (99, 116)), (3, (98, 117)), (3, (98, 101)), (3, (84, 104)), (3, (83, 32)), (3, (81, 117)), (3, (80, 114)), (3, (78, 71)), (3, (69, 32)), (3, (45, 32)), (3, (32, 118)), (3, (32, 117)), (3, (32, 112)), (3, (32, 81)), (3, (32, 45)), (3, (32, 40)), (2, (185, 32)), (2, (166, 10)), (2, (159, 10)), (2, (156, 217)), (2, (156, 10)), (2, (121, 114)), (2, (117, 99)), (2, (116, 116)), (2, (116, 115)), (2, (116, 46)), (2, (116, 44)), (2, (115, 104)), (2, (114, 115)), (2, (114, 98)), (2, (114, 32)), (2, (112, 112)), (2, (112, 105)), (2, (112, 100)), (2, (111, 109)), (2, (111, 100)), (2, (110, 115)), (2, (110, 103)), (2, (110, 101)), (2, (109, 111)), (2, (109, 97)), (2, (108, 46)), (2, (105, 118)), (2, (105, 109)), (2, (105, 98)), (2, (103, 114)), (2, (102, 105)), (2, (101, 116)), (2, (101, 112)), (2, (101, 110)), (2, (100, 117)), (2, (100, 105)), (2, (100, 97)), (2, (100, 44)), (2, (99, 108)), (2, (99, 107)), (2, (99, 104)), (2, (98, 115)), (2, (98, 97)), (2, (97, 114)), (2, (97, 112)), (2, (97, 32)), (2, (83, 69)), (2, (79, 84)), (2, (78, 79)), (2, (76, 79)), (2, (73, 83)), (2, (72, 65)), (2, (67, 111)), (2, (67, 72)), (2, (65, 78)), (2, (61, 10)), (2, (58, 32)), (2, (50, 48)), (2, (46, 110)), (2, (41, 32)), (2, (35, 61)), (2, (32, 109)), (2, (32, 104)), (2, (32, 103)), (2, (32, 102)), (2, (32, 100)), (2, (32, 79)), (2, (32, 78)), (2, (32, 73)), (2, (32, 65)), (2, (10, 10)), (1, (219, 172)), (1, (219, 171)), (1, (219, 170)), (1, (219, 168)), (1, (219, 163)), (1, (177, 219)), (1, (172, 216)), (1, (171, 32)), (1, (168, 216)), (1, (168, 10)), (1, (142, 219)), (1, (137, 219)), (1, (135, 10)), (1, (129, 32)), (1, (119, 101)), (1, (118, 105)), (1, (117, 111)), (1, (117, 108)), (1, (117, 100)), (1, (117, 98)), (1, (116, 112)), (1, (116, 58)), (1, (116, 47)), (1, (116, 41)), (1, (116, 10)), (1, (115, 117)), (1, (115, 115)), (1, (115, 112)), (1, (115, 111)), (1, (115, 108)), (1, (115, 47)), (1, (115, 46)), (1, (114, 116)), (1, (114, 109)), (1, (114, 108)), (1, (114, 99)), (1, (112, 111)), (1, (112, 108)), (1, (112, 101)), (1, (112, 58)), (1, (112, 32)), (1, (112, 10)), (1, (111, 118)), (1, (111, 116)), (1, (110, 121)), (1, (110, 117)), (1, (110, 111)), (1, (110, 107)), (1, (110, 99)), (1, (110, 97)), (1, (110, 44)), (1, (109, 109)), (1, (109, 105)), (1, (108, 117)), (1, (107, 101)), (1, (105, 102)), (1, (105, 100)), (1, (105, 44)), (1, (104, 109)), (1, (104, 108)), (1, (103, 101)), (1, (103, 32)), (1, (102, 117)), (1, (102, 114)), (1, (101, 108)), (1, (101, 102)), (1, (101, 101)), (1, (101, 98)), (1, (101, 58)), (1, (99, 105)), (1, (98, 121)), (1, (98, 108)), (1, (97, 115)), (1, (97, 105)), (1, (97, 100)), (1, (97, 99)), (1, (97, 98)), (1, (89, 82)), (1, (87, 69)), (1, (86, 101)), (1, (86, 69)), (1, (85, 116)), (1, (85, 83)), (1, (84, 101)), (1, (84, 72)), (1, (84, 69)), (1, (82, 77)), (1, (82, 73)), (1, (82, 69)), (1, (82, 32)), (1, (80, 108)), (1, (80, 101)), (1, (80, 89)), (1, (80, 76)), (1, (79, 87)), (1, (79, 86)), (1, (79, 82)), (1, (79, 80)), (1, (79, 70)), (1, (79, 67)), (1, (79, 32)), (1, (77, 83)), (1, (77, 79)), (1, (76, 105)), (1, (76, 76)), (1, (76, 69)), (1, (75, 10)), (1, (73, 84)), (1, (73, 78)), (1, (73, 71)), (1, (72, 84)), (1, (72, 73)), (1, (71, 73)), (1, (71, 72)), (1, (71, 69)), (1, (71, 32)), (1, (70, 32)), (1, (69, 82)), (1, (69, 77)), (1, (69, 68)), (1, (69, 65)), (1, (69, 58)), (1, (68, 79)), (1, (68, 46)), (1, (67, 114)), (1, (67, 79)), (1, (67, 75)), (1, (67, 41)), (1, (66, 76)), (1, (65, 116)), (1, (65, 83)), (1, (65, 76)), (1, (58, 47)), (1, (58, 10)), (1, (55, 45)), (1, (52, 32)), (1, (51, 46)), (1, (50, 52)), (1, (49, 46)), (1, (49, 41)), (1, (48, 55)), (1, (48, 50)), (1, (48, 48)), (1, (48, 10)), (1, (47, 117)), (1, (47, 116)), (1, (47, 47)), (1, (47, 10)), (1, (46, 49)), (1, (46, 48)), (1, (45, 50)), (1, (41, 10)), (1, (40, 85)), (1, (40, 84)), (1, (40, 67)), (1, (32, 119)), (1, (32, 114)), (1, (32, 110)), (1, (32, 108)), (1, (32, 107)), (1, (32, 101)), (1, (32, 86)), (1, (32, 85)), (1, (32, 82)), (1, (32, 76)), (1, (32, 68)), (1, (32, 66)), (1, (32, 51)), (1, (32, 50)), (1, (32, 49))]

```

top_pair = max(stats, key=stats.get)

top_pair

```

Our most common pair is (217, 142) which is at the top of our list (123396, (217, 142) occuring 123396 times.

`chr(217), chr(142)`

('Ù', '\x8e')

## Swapping the pair for a single token

```

def get_stats(ids):

counts = {}

for pair in zip(ids, idx[1:]):

counts[pair] = counts.get(pair, 0) + 1

return counts

def merge(ids, pair, idx):

newids = []

i = 0

while i < len(ids):

if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:

newids.append(idx)

i += 2

else:

newids.append(ids[i])

i += 1

return newids

print(merge([5, 6, 6, 7, 9, 1], (6, 7), 99))

```

Replace a `pair`(6, 7) in a list `[5, 6, 6, 7, 9, 1]` of numbers called `ids` with a single token `idx` 99

[5, 6, 99, 9, 1]

## We have 0-255 tokens, to replace the most common pair with a new token 256:

```

tokens2 = merge(tokens, top_pair, 256)

#print(tokens2)

print("length: ", len(tokens2))

```

length: 1237147

```

vocab_size = 276

num_merges = vocab_size - 256

ids = list(tokens)

merges = {}

for i in range(num_merges):

stats = get_stats(ids)

pair = max(stats, key=stats.get)

idx = 256 + i

print(f'merging {pair} into a new token {idx}')

ids = merge(ids, pair, idx)

merges[pair] = idx

```

## Source:

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

# nostr:npub1nc2je43av297thyt33g7sd0hq7l7p353xzj5yca7l6un3kj5nrgqgqrun4 are you able to help with/ fix/complete this?

nostr:nevent1qqszxxdkdujez636c9kv805w9n39qxxhmlr29wcxrvgr38ucvqn4k2czyrqdusz9qhp2r3z3dzlcft4pxqsqtlcyx22zelhn5sxtrkmwu4mluqcyqqq823c3ye6zt

#LLM #tokenization #grownostr #python #asknostr

Reply to this note

Please Login to reply.

Discussion

You probably want to train an AI model (text-to-speech, LLM, ...) and now want to tokenize Arabic letters. I did some tokenization, but it was in English and German, so I'm not sure if I can help here. The general idea is to get any non-numeric representation as a numeric representation. So, instead of having letters, smileys, or any kind of data representation, you want to find a number that represents the same meaning as the original representation.

For the Arabic language, you might be able to use ASCII, UTF, and so on as used in the blog. You might also be able to include an Arabic font. Another approach that I think would be a good alternative is to use an existing Arabic tokenizer. Have a look at Hugging Face. Coming up with your own tokenizer could be a good idea if you know what you're doing.

However, note that the quality of your LLM highly depends on your tokenizer. Thus, I would suggest going with an already existing Arabic tokenizer. Chances are that it's better than what a non-expert in this field can come up with.

I'm not sure if this has helped you, but for the moment, I cannot invest more time in other things. All the best, my friend.