There is an interesting aspect of this behaviour used in the byte latent transfo...

There is an interesting aspect of this behaviour used in the byte latent transformer model.

Encoding tokens from source text can be done a number of ways, byte pair encoding, dictionaries etc.

You can also just encode text into tokens (or directly into embeddings) with yet another model.

The problem arises that if you are doing variable length tokens, how many characters do you put into any particular token, and then because that token must represent the text if you use it for decoding, where do you store count of characters stored in any particular token.

The byte latent transformer model solves this by using the entropy for the next character. A small character model receives the history character by character and predicts the next one. If the entropy spikes from low to high they count that as a token boundary. Decoding the same characters from the latent one at a time produces the same sequence and deterministically spikes at the same point in the decoding indicating that it is at the end of the token without the length being required to be explicitly encoded.

(disclaimer: My layman's view of it anyway, I may be completely wrong)