Time-sortable, 64bit UUID generation, 10 millions without collision

edwinsn · 2015-04-18 05:32:50

This is a result of my searching for a UUID algorithm that fits in a Int64, inspired by node-druuid (https://github.com/recurly/node-druuid)

The result is suitable for my scenario where data in branch offices will be synced to a single master database, and I don't want other human-error-prone methods such as allocating UID for each database.

One advantage over full random number is that the UUIDs are timestamp-based and thus sort-able.

Collision testing code (a console program, tests 10 millions times):

var
  i, total: Integer;
  ids: TDictionary<Int64, Int64>;
begin
  try
    total := 10000000; // ten millions
    ids := TDictionary<Int64, Int64>.Create(total);

    for i := 1 to total do
    begin
      myId := EyUtils.Generate64BitUUID();
      Writeln(humansInt(i) + ': ' + IntToStr(myId));
      ids.Add(myId, i);
    end;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  ids.Free;
  ReadLn;
end.

The function, which is actually quite simple:

(*
  Generates a time-sortable, 64-bit UUID.
  Inspired by: https://github.com/recurly/node-druuid
  An uuid comprises:
    - A 42-bit timestamp = mill secs. between Now and 2000/01/01 00:00:00:000
    - 22 random bits.
*)
function Generate64BitUUID(aOffsetFromYear: Integer = 2000): Int64;
var
  startYear: TDateTime;
begin
  //first part is 42-bit = timestamp = milliseconds since 2000/01/01 00:00:00:000
  startYear := EncodeDateTime(aOffsetFromYear, 1, 1, 0, 0, 0, 0);
  Result := MilliSecondsBetween(Now, startYear) shl 22;

  //second part = 22 random bits
  Randomize;//must, to reduce the chance of random number collision
  Result := Result or Random($3FFFFF);
end;

Comments and suggestions are welcomed!

Last edited by edwinsn (2015-04-18 05:38:24)

ab · 2015-04-18 08:06:18

Random() has a lot of potential collisions.

IMHO using a monotonic increasing ID + a per-node ID would be safer.

- 42 bits: TimeStamp in ms
- 12 bits: increasing ID
- 10 bits: fixed node ID

edwinsn · 2015-04-18 10:30:11

You are right on the safe perspective, but since in my scenario the data column is quite low so it should be OK.

alpinistbg · 2015-04-18 16:01:02

One advantage over full random number is that the UUIDs are timestamp-based and thus sort-able.

Nobody said they're not. Every number is.

Please, see that thread: http://synopse.info/forum/viewtopic.php?id=2160 and particularly that post: http://synopse.info/forum/viewtopic.php … 529#p13529

The main property of each random number generator is to have uniform distribution. In fact, I am not sure what will be the properties when you call Randomize/Random each time in pairs. May be you will get a dependence between the left 42-bits and right 22-bits because probably it gets the clock as source of entropy.

edwinsn · 2015-04-19 03:34:23

alpinistbg wrote:

One advantage over full random number is that the UUIDs are timestamp-based and thus sort-able.
Nobody said they're not. Every number is.
Please, see that thread: http://synopse.info/forum/viewtopic.php?id=2160 and particularly that post: http://synopse.info/forum/viewtopic.php … 529#p13529
The main property of each random number generator is to have uniform distribution. In fact, I am not sure what will be the properties when you call Randomize/Random each time in pairs. May be you will get a dependence between the left 42-bits and right 22-bits because probably it gets the clock as source of entropy.

I must say I am really not good at this kind of thing, what I actually mean is that as opposed using pure random numbers for PKs, the timestamp+random PKs increase over time - in other worlds, a new record in most cases will have a greater ID value than the old ones.

alpinistbg · 2015-04-19 13:10:50

edwinsn wrote:

I must say I am really not good at this kind of thing, what I actually mean is that as opposed using pure random numbers for PKs, the timestamp+random PKs increase over time - in other worlds, a new record in most cases will have a greater ID value than the old ones.

Why not use Random just once for generating the NodeID instead of requiring the user to enter it (see earlier Arnaud proposal for ID's)? The chance of collision will be rather small.

I personally prefer to use much simpler scheme: per node increasing ID concatenated with the nodeID as insurance for uniqueness.

@Arnaud,
IMHO using time stamps can be quite tricky because of time zones and DST. AFAIK, Windows does not keep the system clock in UTC but makes a conversions, vice versa in Linux. Also it is prone to (potentially incorrect) adjustments from users or automatic DST. Not to even mention the leap seconds adjustment.

edwinsn · 2015-04-19 14:35:36

Thanks alpinistbg, per-node increasing Ids for all tables sounds like a simple good idea

ab · 2015-04-19 16:25:27

@alpinistbg
In mORMot clients, you have a safe way of retrieving the server date time at connection.
See the ServerTimeStampSynchronize method.
We use the UTC dedicated API under Windows, so that we should not be affected by DST changes.
Every timestamp is written in UTC within mORMot (e.g. autofilled TModTime and TCreateTime published fields).
And a small time shift should not be a problem, IMHO.

@edwinsn
Yes, but in this case, your IDs won't be monotonic according to the time.
If your data is to be gathered in a global list, all your content will be distributed according to the initial nodes distribution, not following the time line.

In short, it is a matter of taste, and business logic.
I would never use random IDs, but per-node or per-node + UTC timing IDs mqy be just fine.

alpinistbg · 2015-04-19 20:12:37

ab wrote:

@alpinistbg
In mORMot clients, you have a safe way of retrieving the server date time at connection.
See the ServerTimeStampSynchronize method.
We use the UTC dedicated API under Windows, so that we should not be affected by DST changes.
Every timestamp is written in UTC within mORMot (e.g. autofilled TModTime and TCreateTime published fields).
And a small time shift should not be a problem, IMHO.

As long as for the uniqueness - yes, it should be OK.

But if we consider the monotonic property it may be not the case.
We're talking for disconnected nodes here, i.e. there is no server all of the time, imagine the system clock adjusted backwards for example. This will immediately result in smaller ID's. The adjustment may come even from the system itself when configured to "Intenet time" on the first synchronization with the time server. Also the hardware clock drifts - the user may have a good reason for the adjustment ... Also there is a cases when somebody wants to cheat and adjusts the time to register some action earlier than it happened ...

For the present implementation: GetServerTimeStamp->NowUTC->GetSystemTime, there is no guarantee that it will be ever-increasing. Even the client is perfectly synchronized with the server, the discontinuity can happen on the server itself.

IMHO the only time-related function suitable is, perhaps, the GetTickCount since it is a 'count' and not 'time', but it is pointless because we can just get series of integers, N, as long we're not interested in the duration between each two instants.

ab · 2017-01-22 11:44:47

Of course, if you put the server timestamp back in time, you may have some time shifts, but since each node uses a counter, and refers to the server timestamp received at connection, each node will continue to generate unique IDs.

Exact monotony is not assured between nodes of course, but what we need is "almost" monotony, so that the items inserted in a BTREE index will be grouped together.
The timestamp stored in the ID is known to be an indication of time, not exact atomic time.
This was the main idea behind our TSynUniqueIdentifierGenerator class: creates something similar to https://docs.mongodb.com/manual/referen … /ObjectId/, but with 8-byte storage, not 12-byte storage, so that we may use it as unsigned 64-bit IDs, as expected by our framework.

mORMot Open Source

#1 2015-04-18 05:32:50

Time-sortable, 64bit UUID generation, 10 millions without collision

#2 2015-04-18 08:06:18

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#3 2015-04-18 10:30:11

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#4 2015-04-18 16:01:02

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#5 2015-04-19 03:34:23

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#6 2015-04-19 13:10:50

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#7 2015-04-19 14:35:36

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#8 2015-04-19 16:25:27

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#9 2015-04-19 20:12:37

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

#10 2017-01-22 11:44:47

Re: Time-sortable, 64bit UUID generation, 10 millions without collision

Board footer