On a collision course

This will be a quick post about why you should not make your own identifiers if you can and also if you need to have an identifier that is easy to communicate with/about for clients base it on something that is still very much unique.

Entropy

In order to quantize the measurement of how unique something is we need a unit to measure it in. This measurement is called entropy or the level of entropy and the unit is bits. Next we will compare two identifiers, one is UUIDv4 (Universally Unique Identifier v4) and the other a custom one with a specific format and their respective bits.

UUIDv4

As stated on the Wikipedia page, the number of entropy bits are 122. This is due to the fact a UUID is actually a representation of 128 bit integer. 4 bits are needed to denote the version and 2 bits are needed to mark the variant. So 128 – 6 bits is 122.

Custom format

The custom format will follow the convention of

<first id><hex value><second id>-<day of the month><hex value>-<hex value><hex value><hex value>

The first and second id are single digit values. The hex value comes from a UUIDv4 generated value and the day of the month is a double digit. An example would be: 1F2-01D-36F.

Calculation

To calculate the entropy there are different tactics. One is the Shannon entropy, the other is the Rényi entropy and there is a simple calculation for guessing entropy. The guessing entropy is calculated by doing log₂(A) * L where A is the size of the alphabet, how many possible characters can inhabit a spot, and L is the total length or how many spots we have of these. In our case it is that the first id, second id and day of the month are all static. That leaves 5 spots of hex values which can contain 16 possibilities. Formula will be: log₂(16) * 5 = 20.

So we have 20 bits of entropy vs 122 bits.

Collision

We can calculate the number that is needed in order to get a 50% probability that an identifier will be generated that already exist and therefore there will be a collision. Formula to do this is: 0.5 + √(0.25 + 2 * ln(2) * 2^entropy ). For UUIDv4 this gives 0.5 + √(0.25 + 2 * ln(2) * 2¹²² ) = 2.71 * 10¹⁸ . For our custom implementation this gives 0.5 + √(0.25 + 2 * ln(2) * 2²⁰ ) = 1.206.

This means it takes only 1.206 entries with our new format to arrive at a 50-50 chance of duplicating our unique identifier.

Solution

The solution would be to store the unique id and the displayed id together. Then the lower entropy would be okay, as the user combined with the lower entropy identifier would be unique but only if you stay within the 1206 entries. A person might get above that number though.

So we use the UUIDv4 to store the actual unique identifier. Then for the display one the customer can communicate with a good representation of a number that will be generated once and let us take all letters of the alphabet and numbers for a total of 36 which will be the size of A and make L=9. This means we get about 46 bits of entropy and that would make for 9.876.831 entries needing to be generated to get to 50-50 chance. Since for a single user it would be very difficult to get that many entries I think we would be secure and in worst case check if the id already exists and then generate another one.

#devlife #secops #devsecops