Yup. I am a bit torn as well. I've toyed with a few ideas.

I have some code that defines how a file can be saved a zillion ways. It uses 4byte enums. Note not u32 because screw endianness. Every file is just a byte string. The first 4 bytes tell you the encryption method. Strip those of and use the corresponding algorithm to decrypt. The first 4 bytes of the decrypted file tell you the compression algorithm. Strip them off and decompress. The first 4 bytes of that tell you the serialization method (file type I guess)

Makes it easy to just add new algorithms.

Reply to this note

Please Login to reply.

Discussion

i'm a big fan of compact semi-human-readable sentinels in binary data formats where structure is flexible. i'm using 3 byte prefixes for a key-value store to indicate which table a record belongs to.

same reason i hate RFC numbers BIP numbes and kind numbers and nip numbers. they mean nothing without a decoder ring. they should be descriptive, or gtfo

in any case though, if you are going to need lexicographically sortable numbers, you want big-endian.

it's just nutty to use a number when the thing could be a mnemonic and is not going to be so numerous that you can't fit it into 4 letters.

btw, this is a long established convention, the Macintosh and Amiga system libraries both use this convention of 32 bit mnemonic keys, to indicate filetypes, as key/value structure keys, and so on.

humans are not fucking rolodexes. language is words, not numbers. it is irrelevant that words are numbers: numbers are not words.

> it is irrelevant that words are numbers: numbers are not words.

100%. Unless we want to remove the human brain from these systems altogether then we'll have to deal with its word-centric processing.

It depends on where you are in the tech stack. UTF-8 or even ASCII is not human readable, they are bytes that need to be interpreted as such and rendered as glyphs via a lookup table and a graphics engine.

If required, you can do the same for any enumerated type. The nice thing about text encodings is that they are widely accepted and have many implementations of renderers.

But if you are doing the tech right, most of your base protocol is not going to be human readable, not because legibility is undesirable, but because it is nigh impossible.

Think of the raw data on the line. You need source and destination IP addresses and port numbers. Then you need something like source and destination node IDs, an ephemeral pubkey or nonce (depending on your primitives)

The rest is just gibberish because it is encrypted. None of that can be easily made parsible by humans.

Next you have the task of turning that encrypted blob into something useful. You need more keys and signatures etc. Eventually you get some final decrypted unpackaged data that you hand off to the client application. The underlying protocol doesn't care what the bytes it encapsulates/decapsulates are. It can't, if it knew anything about it, you'd be leaking metadata for men in the middle to hoover up.

Once you get to the client application, I agree with you. You want your data to be human friendly.

Generally I am a fan of being careful of how complex data in social media etc gets interpreted because people tell lies. For instance, do you translate an npub into the name its owner chooses, the name the viewer chooses, or a name the community chooses? Same with profile picture.

The first sounds good, but you get impersonation or other lies. Numbers on the wire have to be interpreted the way the recipient wants, not the way the sender wants.

the decoder ring for UTF-8 is always available. a decoder ring for the meaning of kind numbers or BIPs is not.

you don't even need type values if your structure is rigid, like nostr events, you just define an encoding order, and if the field is fixed length it's redundant to have a length prefix for it.

also mentioning network addresses, even those are handled in a human readable form, and not the native. native form of an IPv4 address is a 4 byte for address and 2 bytes for the port.

anyhow, yeah, of course, you use whatever fits the task best but one of the biggest reasons i favor mnemonic sentinels in binary data is that these are stable values. if you define a series of numbers and a whole bunch of them become deprecated, there's holes in the number space, you can't reuse them without a breaking change. a 2 or 3 character sentinel (or even 4) is never going to change, and has enough space so that you can just add more with such as an extra character signifying version or whatever.