Data replication in a multiple master environment

alpinistbg · 2014-11-12 13:36:35

Hi there!

First to say that I'm reading the SAD for a couple of days and I'm really impressed by the framework, praises!
Coming from the C++/RDBMS world, I have to go through with the painful paradigm shift (and the induced impedance )!

We have here a system which I have intention to redesign, It has saturated to a margin that it is hardly maintainable, it is RAD, it is rigid, it is whatever cited as 'bad' of the traditional Delphi/CBuilder design. But it have a multi-master replication, one aspect which eludes me into the SAD. This asset of the existing system is quite useful and valuable.

Anyway, here is my observations, they're about the surrogate keys used in the mORMot:

mORMot object ID's are Int32, they are actually used as PK into the underlying SQLite tables, actually they are unique just for the given table/class.
I perfectly understand why object ID's are Int32, that way they're the same size as pointers and some can painless do MyObj := TMyObj(MyDataBase.Add(One,True)) and vice-versa.
Such a surrogates introduced a locality, which is not easy to overcome when dealing with multiple databases.
Most persistence papers introduced generation of surrogate keys according to some scheme for global uniqueness. mORMot uses auto-id's (correct me if I'm wrong).
Anyway, the TSQLRecord published properties may contain different values written in the same memory cell (see 2). They can be integers or references and it is up to the programmer to handle it in a proper way:

From the SAD wrote:

TSQLRecord published properties do not contain an instance of the TSQLRecord class. They will instead contain pointer(RowID), and will be stored as an INTEGER in the database. So the main rule is to never use directly such published properties, as if they were regular class instance: otherwise you'll have an unexpected access violation error.

and later:

When accessing the detail objects, you should not access directly to FirstOne or SecondOne properties (there are not class instances, but integer IDs), then use instead the TSQLRecord. Create(aClient: TSQLRest; aPublishedRecord: TSQLRecord: ForUpdate: boolean=false) overloaded constructor, as such:

And finally, my real question is:

Isn't it worth to introduce a special class for the object ID? Or to encapsulate "smart-pointer" like semantics into the TSQLRecord? I think this will be entirely in the spirit of the OOP and will completely overcome the Int32 limitation.

Best regards,
Yuliyan

ab · 2014-11-12 13:57:05

Thanks for your positive feedback!

Some preliminary remarks:
1. Sometimes, the TSQLRecord published properties MAY be an actual instance, e.g. for TSQLRecordMany instances, or when you use Create*Joined() constructors... BTW are not those Create*Joined() constructors what you expect?
2. You can use a RecordRef kind of property, if you want to store an ID and its related table class.
3. The int32 limitation is not only due to this type cast to a pointer, but due to the design of SQLIte3 virtual tables (which we use in our ORM core), which relies on a RowID/ID integer primary key.

About your question, I do not know exactly how to implement it, nor what you expect.
So your proposal would like to use a generic-based encapsulation for such published fields, right?

Something like:

TSQLOrder = class(TSQLRecord)
...
published
  property Client: TLazy<TSQLClient> read fClient write fClient;

So that TLazy<TSQLClient> would allow lazy instantiating, and clean access to the ID or a real instance?

We may also introduce a new TSQLRecordAuto sub class, which would instantiate all TSQLRecord published properties, except the one with TLazy<T> type definition.

alpinistbg · 2014-11-12 14:31:14

Thank you for the quick reply!
And my apologies for being not so familiar with the details of SQLite3, mORMot, Object Pascal and such technicalities... My suggestion was is it possible to encapsulate something different than Int32 into, and use it as a GUID (in the general sense, not MS).
May be the Lazy<> is what I have in mind with my limited knowledge about the Delphi generics (I'm from C++ world), and if it offers enough functionality to replace the RowID's in a transparent way - then it is okay.

Just to note a few more things about the system I'm going to redesign - the necessity for 24/7 operation forced us to implement a multi-master replication scheme, it is modeled in a similar principles as the MS proprietary one. In the time of development we faced the problem of the PK global uniqueness. The original (MS) GUID solution just sucks (that is another topic).

Nevertheless, having more than one DB (for writing) imposes the usage of surrogate generation scheme which guarantees the PK's are globally unique. The mORMot OID's are not, and I am just wondering how can I implement such a functionality. May be there is another way to achieve this and I'm simply too ignorant to know it.

ab · 2014-11-12 15:24:54

If your point was about replacing ID/RowID from integer to another kind (e.g. TGUID or string), it won't be possible directly, due to the design of SQLIte3 virtual tables (which we use in our ORM core).
We may change the type, but we would have to maintain a separate lookup table... since we need this Integer field within the ORM core.

I'm not very familiar with the MS GUID solution, but I know the MongoDB ObjectID paradigm, as we implemented direct access to a MongoDB server in our SynMongoDB.pas unit.
I suspect both are quite similar. See http://docs.mongodb.org/manual/reference/object-id/
What was wrong with the MS solution? (any feedback is welcome)

We have a feature request for automatic data replication.
See http://synopse.info/fossil/tktview?name=3453f314d9
Our framework is REST-based, so there is a single point of access which is the resource-based scheme of REST.
See for instance how you can access a table hosted in another REST server, in one method call: http://synopse.info/files/html/Synopse% … ml#TITL_93

Multi-master replication is one step beyond that. But if each master has its own table, you could manage to make it work via some by-table redirection, and a local cache. A RecordRef may store a reference to a record in several tables of the model, in one integer. Then, tuned cache at each local REST node may help reducing the load on other nodes. In short, the record would belong to the table on which it has been created, and may be modified from other nodes.
Nevertheless, I suspect that your implementation of multi-master replication may help us find out some implementation patterns, to implement a true replication system.

Another possibility may be to rely on the replication abilities at storage level.
I mean, relying on MongoDB replication, then letting the mORMot server of each node just access its local MongoDB instance. See http://docs.mongodb.org/manual/replication/
Or use PostgreSQL replication, which is known to be efficient - but perhaps more complex to administrate. See https://wiki.postgresql.org/wiki/Replic … on_Pooling
The function TSQLRestStorageMongoDB.EngineNextID should be overridden to compute an unique per-node ID, either by adding a small bit shift for each node, or by pre-allocating ID ranges for each node.

Some additional questions:
- How many nodes do you have?
- How many replicated data do you expect to store?

alpinistbg · 2014-11-12 16:52:34

It will take me some time to browse the links, but I'll try to give some quick explanation. I'll presume that the requirement of global unique identifiers is beyond doubt when it comes to multi-master replication (right?).

ab wrote:

What was wrong with the MS solution? (any feedback is welcome)

All DB engines use some data structures at the file level. MSSQL, Firebird, SQLite stores the tables in B-trees. MSSQL calls it "Clustered index". When using GUID as a PK, the B-tree pages are organised according the GUID values which are not monotonic. Usually, most of the tables follow some temporal dependence. Log tables are the most obvious example - they are always appended at the end (with regard of time). Besides that, usually the log tables doesn't have other PK than surrogates. It finally happens that your biggest table have the worst structure which dramatically reduces the data throughput. Even more annoying is that the log tables are usually queried for a specified period of time - using secondary index actually makes the server to engage most of the file pages in memory, because the rows were physically spread across much more pages than they need to. That turns a simple report into a nightmare. They later introduced a monotonic NEWSEQUENTIALID() to overcome this deficiency. See Good Page Splits and Sequential GUID Key Generation.

The lack of the monotonic property is the main trouble with the GUIDs.

2) GUID clustered index keys cause new rows to be uniformly distributed across the clustered index, causing expensive page splits, poor cache performance and about 30% unused space on every database page.

Another problem

1) They are large (16 bytes), and enlarge all non-clustered indexes.

Further

ab wrote:

Another possibility may be to rely on the replication abilities at storage level.

At this point I'm concerned about the mORMot multiple caches and the performance penalties imposed invalidating them. It will be best if the replication system is coherent with the caches.
PostgreSQL replication was something in incubation at the time we started the implementation, I'll return back to reading to see the current status.

ab wrote:

The function TSQLRestStorageMongoDB.EngineNextID should be overridden to compute an unique per-node ID, either by adding a small bit shift for each node, or by pre-allocating ID ranges for each node.

I'll investigate this suggestion, but not quite sure that Int32 will suffice for my needs. If we assume that nodes will be numbered 0-127, that makes 7 bits and leaves just 24 (signed) for the increasing part. 16M rows can be easily achieved taking in mind that the generator is the same for all tables.

ab wrote:

Some additional questions:
- How many nodes do you have?
- How many replicated data do you expect to store?

One working implementation is with 2 levels of tree-like filtered replication; one central node, 5 middle-level nodes and 12 leaf nodes, total of 18. The central node collects the events from all leaf nodes and there the number of rows goes critical. Of course all that is subject to revision/redesign.

Regards,

ab · 2014-11-12 18:17:39

Thanks for the input.

So your design is not a "pure pairs" design, but a more classic hierarchical design, with 3 levels.

What about the amount of data?
Total data, in KB or rows? Temporary data on each node during off line mode?
Could you not split the data into smaller table, reducing the scope of each data? (i.e. implementing some "ownership" of data, making some part of the data R/W for one node, then R/O for other nodes).
What is your expected delay for synchronization around nodes, for the data to be eventually consistent?
Do you need ACID behavior, if yes, at which level?
How does it resolve the conflicts? Do you use a version number for the system?

alpinistbg · 2014-11-12 18:39:15

Actually, I can supply you with all the details about our solution, just feeling uneasy to bother someone with all particularities. Also still I'm not sure how that can apply to mORMot and will it be of real value for the project. But if there is a real interest of that aspect (i.e. multi-master replicas), I would like to know

ab · 2014-11-12 18:44:29

In fact, it may be very valuable.
An existing solution is a very good ground to build a design.
You already know the strengths and weaknesses of your existing solution.

For mORMot, I would like to implement, with a simple design, most replication schemes, up to what you call multi-master replicas (which is just some kind of peer replication).
We have already in the framework a lot of components to implement it.
We implemented history tracking in a few code lines, some weeks ago - see http://synopse.info/files/html/Synopse% … ml#TITL_85
I would like to implement some simple automatic replication process among mORMot servers.
This is expected by a lot of our users, and would make another unique feature to the framework - even in comparison with Java or C# systems.
This is why your feedback is very valuable.

And if you are willing to test our solution, and help it scale and debug, everyone would benefit from it: you would rely on a lot of existing bricks of our framework, and we would benefit from real feedback.

ttomas · 2014-11-12 23:37:30

PK ID is big restriction to use mORMot to migrate my systems.
1. My POS system Ex. 1 Central office and 20 POS locations (1 central db and 20 remote db), every POS location have 1-20 clients (sales point). POS locations must work even if internet connection VPN is lost. My PK is ID_LOCATION, ID
2. My accounting system, 1 Accounting office have >100 customers, some customers replicate data to accounting office. My PK is ID_USER, ID
ID already contain fiscal year bit shift ex. 140000001 for 2014
ERP system in development SaaS is also planed with PK ID_USER, ID. Several customers will use same database as storage. Some performance test for 1000 customers confirm that it is best to have 10 database with 100 customers per db instead of 1000 db-s.
Db-s Firebird and PostgreSQL.

alpinistbg · 2014-11-12 23:48:25

OK, got it!
Long time ago, when I was a student, someone said "The one who undertakes a merge replication is just looking for trouble...". So we are

ab wrote:

What about the amount of data?
Total data, in KB or rows? Temporary data on each node during off line mode?
Could you not split the data into smaller table, reducing the scope of each data? (i.e. implementing some "ownership" of data, making some part of the data R/W for one node, then R/O for other nodes).
What is your expected delay for synchronization around nodes, for the data to be eventually consistent?
Do you need ACID behavior, if yes, at which level?
How does it resolve the conflicts? Do you use a version number for the system?

Approx. numbers as I recall: total number of rows: 6M+, synchronization performed on 3-8 min. with mutual prime number of minutes for different nodes, about 100-150 rows max on a single synchronization, data ownership is implicitly defined by the application itself (three levels are actually different in their functionality). ACID is not expected at 100% but every violation must have a clear manifestation. Conflicts are resolved by preselection who will be the winner from the two participants.

For the feature request: http://synopse.info/fossil/tktview?name=3453f314d9

I think the simplest way to implement such a master/slave synchronization is to include a table-wide version number column, VN, no need for TModTime,TCreateTime. Every insert initializes the column in the inserted row with the max(VN)+1, every update also updates it into the updated row to max(VN)+1. The rows for the next synchronization then can be simply selected from the master as those with VN greater than the max(VN) from the slave table. The trouble is the deletion, but it can be either resolved by marking rows as deleted rather than actually deleting'em (and setting VN to max(VN)+1) or making a side "tombstone" table with the PK's of the deleted rows. It is not a big trouble, as long a surrogate (not a natural PK) is used which is guaranteed to be "stable".

The peer replication case:

It is crucial to have a way to uniquely address a row across multiple databases, so a unique value (among all nodes) must be assigned for each inserted row. It may not be the PK, but if the PK is auto-increment value, it may happen that the same value is already used in the other DB and that will lead to inevitable conflict. So, it will be best if we can use that "address" column as a PK also. Furthermore, for the reasons mentioned before, it will be also good for the generated unique values to have additional properties, e.g. monotony.

In a true RDBMS case, each CRUD operation is registered through a corresponding trigger. The registration is made in a side table, lets call it "journal" table. Each row is consisting of: The name (or some id) of the table, the type of the operation (Insert, Update, Delete), the unique "address" of the affected row, and finally the originating node id of the operation. The last thing is for distinguishing the local changes from external ones.

The synchronization goes as follows: for each synchronized table (the order is important, more on that later) the greatest "unique address" of external change is taken and rows with the greater value extracted from the other side journal. They should describe the latest changes from the other side. These rows left-joined with the corresponding table gives the complete data set to be INSERT OR UPDATEd, deletes have NULLs in their data columns. Of course this is the case when the unique "addresses" are ordered in the time domain, otherwise the comparison must be made on an additional timestamp field. The same thing must be repeated backwards to complete the table synchronization. It is also important to transfer the foreign journal entries into the local journal to mark the changes as "already applied".

Some difficulties exists, one of them is the FK enforced DRI. By the way, during the replication MSSQL disengages most of the constraints and triggers with NOT FOR REPLICATION directive. Our solution is to synchronize the tables in strict topological order, with master tables first, then the slave tables. This will work when no loops exist into the FK constraints, also will work for a simple loop into the same table since the modifications are timely ordered. Will not work when longer loops exist, but I haven't seen so far such an "exotic" DB schema. There is a tool developed which extracts the dependencies between the tables and does the topological sorting.

Other difficulty is the initial snapshot of already operational DB, the simplest way to do this is to perform a backup and then restore it on the other node. Alternative approach is to define additional operation in addition to the I,U,D, e.g. Register and to perform that operation on all rows in the entire database. On the other side all such operations will be treated as INSERT OR UPDATE and thus (taking much mooore time) the entire content will be transferred. Of course, on an unprepared DB the row insertion order is unknown and the circular dependencies can be a real problem. The advantage of the latter approach is that the source DB doesn't go offline.

But after such a prolonged writing, the introduction of the "shared nothing" principle discards all the concerns mentioned above

Regards,

moctes · 2014-11-13 04:30:44

About sortable unique ID's there is this blog post from Rob Conery which shows a way to solve it, although is implemented as a PostgreSQL function and DB Schemas maybe you could borrow some concepts ?

http://rob.conery.io/2014/05/29/a-bette … ostgresql/

Regards

ab · 2014-11-13 16:05:32

ttomas wrote:

PK ID is big restriction to use mORMot to migrate my systems.

IMHO you should use the TSQLRecord.ID as an increasing integer, and not put the fiscal year in it.
The displayed ID may be diverse than the TSQLRecord.ID value, with whatever format you need.

Then I guess that your data should have much benefit of being sharded, i.e. un-normalized.
Since once one order has been emitted, it will be stable, I suspect you may put one POS process in a single TSQLRecord content, using high level structures like TDocVariant, dynamic arrays or records.

I suspect you are still too much thinking in terms of RDBMS.
If you use mORMot with one internal SQLite3 backend, some performance restrictions would also in practice disappear.

ab · 2014-11-13 16:09:07

moctes wrote:

About sortable unique ID's there is this blog post from Rob Conery which shows a way to solve it, although is implemented as a PostgreSQL function and DB Schemas maybe you could borrow some concepts ?

This is in fact pretty close to what MongoDB expect.
See for instance how we generate an ID on client side in our SynMongoDB.pas unit, in TBSONObjectID.ComputeNew.

alpinistbg · 2014-11-13 16:44:26

ab wrote:

moctes wrote:
About sortable unique ID's there is this blog post from Rob Conery which shows a way to solve it, although is implemented as a PostgreSQL function and DB Schemas maybe you could borrow some concepts ?
This is in fact pretty close to what MongoDB expect.
See for instance how we generate an ID on client side in our SynMongoDB.pas unit, in TBSONObjectID.ComputeNew.

In fact, there are different schemes for generating ID's, Hi-Lo, LB Alloc, etc. As long as they are globally unique, they should do the work. Even the UUID/GUID's!

ab wrote:

If your point was about replacing ID/RowID from integer to another kind (e.g. TGUID or string), it won't be possible directly, due to the design of SQLIte3 virtual tables (which we use in our ORM core).

@ab I am browsing through the sources, but I can't find where the that dependency exists (besides pointer/Integer thing). As long as I can see, the RowID's in SQLite3 are Int64. Even the https://www.sqlite.org/c3ref/module.html defines the pRowid in (*xRowid) is an int64. Of course, the source is quite big, but I'll appreciate if you can pinpoint me some line where this can be seen.

Regards,

Last edited by alpinistbg (2014-11-13 16:46:43)

ab · 2014-11-13 16:59:39

alpinistbg wrote:

OK, got it!

To sum-up:
1. Eventually consistent, with conflict notification.
2. Table versioning.
3. Unique (monotonic) ID among nodes per table.
4. Peer replication via a CUD journal table, left-joined on id (+timestamp).
5. PK are preserved when processed in topological (master first) order.
6. Initial snapshot via backup or dedicated entry in the journal.

I would like to use a similar pattern, but with some differences and optimization.
Of course, we would rely on a as-unnormalized-as-possible data in mORMot, to ease the replication.
The main point would probably be the journal. We would use a dedicated database for this (which may be hosted on a slower drive), and a dedicated TSQLRecordSynch (similar to TSQLRecordHistory).

Eventually we would have to enhance the framework IDs to be Int64 everywhere, but for explicit TSQLRecord published properties.

type
  TID = Int64;

So, instead of

published Client: TSQLRecordClient read fClient write fClient;

we may define, using a new TID: Int64 type defined in mORMot.pas to be used everywhere instead of integer:
1. if the column name match the table foreign name: nothing to do

published Client: TID read fClient write fClient;
// here the "Client" column will be tied to the "Client" table, e.g. TSQLRecordClient

2. if the column name does not match the table foreign name
2.1. before Delphi 2010: we specify it as a specific type so that the table would be retrieved from the type name (TSQLRecordClientID -> TSQLRecordClient)

type TSQLRecordClientID = type(TID);
...
published OrderedBy: TSQLRecordClientID read fOrderedBy write fOrderedBy;

2.2. before Delphi 2010: OR we specify the table name within the property name

published OrderedBy_Client: TID read fOrderedBy write fOrderedBy;

2.3. since Delphi 2010: we specify it via an attribute

[Join(TSQLRecordClient)]
published OrderedBy: TID read fOrderedBy write fOrderedBy;

alpinistbg · 2014-11-13 18:59:39

Two doubts:

I don't think the timestamps are good, imagine nodes in different timezones, DST transitions, somebody changed the date, etc. Better is to rely on the monotony of the ID generator;
Into the RDB design and when normalizing, the number of tables tends to increase. This makes the change "impact" area smaller and gives less probability for conflicts. When de-normalizing, the entire aggregate will be contained in a single row and a slightest change will induce conflict; Imagine a proforma invoice with two rows, in a classic RDB the rows will be in a secondary table, when different rows are modified on different nodes then everythng will be ok, but when denormalized, the rows will be contained into the proforma and modifying any of the rows will mark the entire document as modified;

Personally, I don't like the attributes at all. I have a long experience with the .NET Frameworks (by necessity) and my opinion is that when somebody have no idea how to express something then he "invents" the attributes.

Last edited by alpinistbg (2014-11-13 19:07:27)

ab · 2014-11-13 22:20:42

That's it...
I've made another MAJOR change of the framework.
Now all ORM IDs are now declared as TID (=Int64) instead of integer - also added a new TIDDynArray type to be used e.g. for BatchSend, and declared the TRecordReference type as Int64 - whole API is impacted.
See http://synopse.info/fossil/info/2b677bd490
You pushed so much... that I did it!

We will see what we do for TSQLRecord published properties.
It will work directly, as soon as the IDs are still encoded in 32 bit (or if you are under Win64).
But in fact, we could simply define the fields which may be larger than 32 bit as TID, then let the application code create the foreign keys, when creating the TSQLModel.

@alpnistbg
1. Yes, monotonic version numbers are better. I will go in this direction.
2. I'm not convinced that denormalizing would increase the change impact, with the implementation scheme I would like to do. On the contrary, the whole aggregate would be replaced. But you can still normalize the data which need to be modified in an ACID way.
3. I do not like attributes either... I prefer that the data is tuned when defining the TSQLModel, leaving the TSQLRecord alone from any persistence detail.

mingda · 2014-11-14 07:16:51

another function need consider, the TSQLRecord current don't have old value, for e.g. a object

TTest = class(TSQLRecord)
published
property column1 rawutf8;
property column2 rawutf8;
property column3 rawutf8;
...
end;

when two user read a object at same time, get the same object {"ID":1,"column1":"value","column2":"value","column3":"value"},
if userA update column1 to "value_updateA",
userB update column2 to "value_updateB",
current if userA update first, userB next, then the finally object value will be {"ID":1,"column1":"value","column2":"value_updateB","column3":"value"}, userA's change is lost,

if TSQLRecord have old value, use the old value generate upate sql,
userA update will generate
update Test set column1 = 'value_updateA' where id = 1 and column1 = 'value',
userB update will generate
update Test set column2 = 'value_updateB' where id = 1 and column2 = 'value',
this solve two problem, 1) update override other column's value 2) multi user update conflict.

for replication, if we have such journal,
1). {"ID":1,"column1":"value","column2":"value","column3":"value"}
2). {"ID":1,"column1":"value_updateA","column2":"value","column3":"value"}
3). {"ID":1,"column1":"value","column2":"value_updateB","column3":"value"}
after the last change merge, we only get the object value:
3). {"ID":1,"column1":"value","column2":"value_updateB","column3":"value"}, this is not expect.

if we have old value, such journal will look like such:
1). {"ID":1,"column1":"value","column2":"value","column3":"value"}
2). {"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column1":"value_updateA","column2":"value","column3":"value"}}}
3).{"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column1":"value","column2":"value_updateB","column3":"value"}}}

or we can have the not update empty:

1). {"ID":1,"column1":"value","column2":"value","column3":"value"}
2). {"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column1":"value_updateA"}}}
3).{"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column2":"value_updateB"}}}

then we can merge it to
{"ID":1,"column1":"value_updateA","column2":"value_updateB","column3":"value"}

this will greatly reduce conflict, for userA, userB update same column, such change later user
update will get conflict hit, since the update will failed.

just my personal thought, thanks!

alpinistbg · 2014-11-14 08:39:38

@mingda
Thanks for sharing your thoughts! It is worth to note that the merge replication is a vast topic for discussion and it is a matter of benefits and trade-offs what decisions the developer will made in each concrete situation.
Your considerations are about the "granularity" of the journal, the approaches may be for "row" or "column" values. In my previous post I expressed some doubt about the aggregates and that is about the same thing. I'll try to explain:

mingda wrote:

if TSQLRecord have old value, use the old value generate upate sql,
userA update will generate
update Test set column1 = 'value_updateA' where id = 1 and column1 = 'value',
userB update will generate
update Test set column2 = 'value_updateB' where id = 1 and column2 = 'value',
this solve two problem, 1) update override other column's value 2) multi user update conflict.

Irrelevant, because each user works in it own DB, and what you describe is more subject to transaction isolation (if you suspect something will change between select/update) than the SQL logic. Furthermore, you can't rely on other fields to select because they are not immutable as OID's are.

mingda wrote:

if we have old value, such journal will look like such:
1). {"ID":1,"column1":"value","column2":"value","column3":"value"}
2). {"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column1":"value_updateA","column2":"value","column3":"value"}}}
3).{"old record":{"ID":1,"column1":"value","column2":"value","column3":"value"},
"new record":{{"ID":1,"column1":"value","column2":"value_updateB","column3":"value"}}}

Here you want to introduce the "column" value granularity in a some redundant way, generally, the finer granularity is, the less probability for conflict exists. But it is a bit more complicated for the implementation. May consider the YAGNI principle ...

In our legacy system, there are inherent ownership of the different tables, so we decided that the row granularity is perfectly enough for the purpose.

Another decision that we made was that the conflict resolution will be in favor of the node which started the synchronization (assuming it is on a higher administrative level), i.e. the synchronization direction defines the conflict winner.

According to many papers on the Web, there is no such thing as "Conflict-free replication" in general, so the conflicts are inevitable and we must find a way to live with them. It is a matter of clearly defined conflict resolution discipline.

Regards,

alpinistbg · 2014-11-14 10:17:11

@ab
I was engaged several months in implementing the replica in T-SQL+FPC. Nevertheless, I want to note why mORMot way should be easier. BTW, much or less as you said:

ab wrote:

We have already in the framework a lot of components to implement it.
We implemented history tracking in a few code lines, some weeks ago - see http://synopse.info/files/html/Synopse% … ml#TITL_85

The REST transport layer is just perfect for the purpose. Previously restricted at TDS with no options;
The tracking can be implemented in Pascal, no need for T-SQL/P-SQL/x-SQL triggers and hacks for detecting replication sessions or filling default values on OID columns;
Engaging the change-sets into the memory/cache can greatly reduce the row/table locks and thus improve performance;
It will be very easy to find the table dependencies via RTTI and make the right master-detail tables sorting instantly. Now the order is hard-coded and it is computed (tool) from the DB metadata during the planning stage;
The OID's can be computed without a DB round-trip for each row.

Regarding the 4).

ab wrote:

2.1. before Delphi 2010: we specify it as a specific type so that the table would be retrieved from the type name (TSQLRecordClientID -> TSQLRecordClient)
type TSQLRecordClientID = type(TID);
...
published OrderedBy: TSQLRecordClientID read fOrderedBy write fOrderedBy;

I think that should be the natural way.

Regarding the 5).

Consider my T-SQL OID generator:

CREATE procedure dbo.dbclone_genkey
  @out_key bigint output,
  @in_increment bigint = 1
as
begin
  declare @sid bigint, @key bigint
  
  set transaction isolation level repeatable read
  set xact_abort on
  begin transaction
  
  select top 1 @sid = dbclone_id, @key = dbclone_genval 
  from dbclone_siteid order by dbclone_ord
  
  update dbclone_siteid
  set dbclone_genval = dbclone_genval + @in_increment
  where dbclone_id = @sid
  
  commit transaction
  
  set @out_key = @key * 10000 + @sid;
end

Here we have @sid for a DB site identifier and @key for the next key to be returned, which must be modified and written back. This is our bottleneck. All CRU's must go through that procedure. The solution is to pre-allocate a range of ID's with the @in_increment parameter and to have just one round-trip for a block. This is not possible in pure T-SQL, but absolutely feasible at the mORMot OPF level.

I'll continue to feed you with the details, at least until you say it is enough .

And WOW! You're lightning fast with the changes!

EMartin · 2014-11-14 11:21:07

I believe that would be fantastic have a "native" replication with mORMot. We are using http://www.symmetricds.org, is developed in Java and run as external service that generate tables and triggers inside the database to be replicated (Firebird 2.5). It also handles file replication and the schema DDL replication. With DDL schema replication I mean that when a table is created, the DDL can be put in a queue for replication and the other extreme receive the schema DDL and create the table. I think that DDL replication it should be taken into account. The SymmetricDS replication put all data for replication tables inside of each database, may be mORMot can use the SQLite3 database in a similar way, this allow that if database service is restarted the replication can continue with pending data for replication (SymmetricDS do it).

Best regards.

alpinistbg · 2014-11-14 13:31:33

@EMartin
Thanks for the pointer, I wasn't aware of the product, it looks feature-rich ... and claims to have a lot of advanced functions.

I believe that mORMot have it's internal handling of the schema changes and may be the task of schema syncing is a lot simpler compared to a classic RDBMS. Mssr. Bouchez could express an opinion on that.

alpinistbg · 2014-11-14 14:59:11

More details on the journals...

A small excerpt from the journal:

key           siteid  tablename  recordkey     op stamp                   donorkey     donorsid
------------- ------- ---------- ------------- -- ----------------------- ------------ --------
52424390017   17      Subscript  52424380017   I  2014-01-27 10:32:33.187 52424390017  17
52424470017   17      Subscript  52424380017   U  2014-01-27 10:32:37.497 52424470017  17
52424490017   17      Subscript  52424480017   I  2014-01-27 10:32:43.420 52424490017  17
52424530017   17      Subscript  52424480017   U  2014-01-27 10:32:49.280 52424530017  17

Here key is the PK of the table, siteid is the identifier of the site where the CRUD operation occurred, tablename, recordkey are the row address where occured, stamp is the timestamp, donorkey is the key of the foreign journal table from where the change was imported, donorsid is the identifier of the site from where it was imported.

(each site has unique small number for site id, 'donor' is the foreign site)

On local updates donorkey equals key, donorsid equals siteid.

And the sync goes like this:

// check in our journal for last update from that site/table
SQL.Add( 'select coalesce(max(jlocal.donorkey),0)' );
SQL.Add( 'from journal jlocal ' );
SQL.Add( 'where donorsid = :dsid and tablename = ''' + tableName + '''' );
Params[0].Value := donorSiteID;
Open;
lastEntry := Fields[0].Value;

That is for retrieving the last update applied from the specified 'donor' for a given table.

And then:

SQL.Add( Format( 'select top %d', [recordsLimit]));
SQL.Add( '  j.key as "$$key", j.siteid as "$$siteid",');
SQL.Add( '  j.tablename as "$$table", j.recordkey as "$$rkey",');
SQL.Add( '  j.operation as "$$oper", j.stamp as "$$stamp",');
SQL.Add( '  j.donorkey as "$$donorkey", j.donorsid as "$$donorsid",');

// Here are the fields from the data row with the original field names
SQL.Add( '  l.*');

SQL.Add( 'from journal j left join "' + tableName + '" l on');
SQL.Add( '  j.recordkey = l.key');
SQL.Add( 'where');

SQL.Add( '  j.key > :lastentry and' ); // Newer journal entries
SQL.Add( '  j.donorsid <> :lsid and' ); // Not originating from here!
SQL.Add( '  j.tablename=''' + tableName + ''' and' ); // For requested table
SQL.Add( '  (j.operation = ''D'' or l.key is not null)');

// The keys are increasing so the order will be chronological
SQL.Add( 'order by j.key');

Params[0].Value := lastEntry;
Params[1].Value := localSiteID;

Open;

Now we have a dataset with the rows inserted, updated or deleted in the foreign table since the last sync. The reason to include 'D' operations is to process them in chronological order.

      while not EOF and (recordsLimit > 0) do
       begin
           case Operation of
             'I': InsertRec;
             'U': UpdateRec;
             'D': DeleteRec;
             'R': CheckInRec; // Insert or update?
           else
             raise EDatabaseError.Create('Unknown operation!');
           end;
         ...
         // Copy the foreign journal entry just processed into the local journal
         InsertJournalEntry; 
         ...

The InsertJournalEntry goes like this:

procedure InsertJournalEntry;
  begin
    with qryCloning do
    begin
      SQL.Text := 'exec dbclone_insert_in_journal :key, :siteid, :table, :rkey, :oper, :stamp, :donorsid';
      Params[0].Value := qryDonor.FieldByName('$$key');      // This value will be written in the donorkey column!
      Params[1].Value := qryDonor.FieldByName('$$siteid');
      Params[2].Value := qryDonor.FieldByName('$$table');
      Params[3].Value := qryDonor.FieldByName('$$rkey');
      Params[4].Value := qryDonor.FieldByName('$$oper');
      Params[5].Value := qryDonor.FieldByName('$$stamp');
      Params[6].Value := donorSiteID;
      ExecSQL;
    end;
  end;

The only trick is the usage of donorkey and donorsid values. Used that way there is no need for a separate table for keeping information about each remote site and the corresponding sync progress. Also, it keeps track for the changes propagation among the databases.

Hope no one got bored!

Regards,

ab · 2014-11-14 15:26:48

I'm not bored, on the contrary!

How do you make the purge of the journal?

If I understand correctly, any issue due to potential clock de-synchronization is fixed by the order on which you apply the journal events?

Since the site id is part of the record id, why do you maintain separate donorid/siteid columns? Only for performance reasons?

alpinistbg · 2014-11-14 17:08:23

ab wrote:

How do you make the purge of the journal?

For the externally imported changes it suffice to leave just the rows with max(donorkey) for each <tablename, donorsid> pair. For the local changes purging must be done according to the most lagging external site, but again, one row must be left for each <tablename, donorsid=localsid>. Now, the assumption is that no one will lag more than 30 days, since no information is recorded about the lag.

ab wrote:

If I understand correctly, any issue due to potential clock de-synchronization is fixed by the order on which you apply the journal events?

It should be.

ab wrote:

Since the site id is part of the record id, why do you maintain separate donorid/siteid columns? Only for performance reasons?

That was in case I decide to change the scheme of generating keys (set @out_key = @key * 10000 + @sid); Besides, I was not quite sure how the arithmetic will impact on the index processing.

I wanted everything to work in the first version, and as we all know "Premature optimization is the root of all evil" . So I stuck just to the bare minimum. Never optimized it since then.

BTW when I got rid of the MS subscriptions/publications, I got so much free room that I decided even not to normalize the journal (tablename). Actually there is lot to be improved, e.g. the local row deletion makes obsolete all changes made before to the same row (but may impact DRI), etc. but, do I really need it?

ab · 2014-11-14 23:32:02

About how to associate a TID to a given table, for ON DELETE SET DEFAULT triggers, we just defined the following:

You can optionally specify the associated table, using a custom TID type for the published property definition. In this case, you would sub-class TID, using tableNameID as naming convention.
For instance, if you define:

type
TSQLRecordClientID = type TID;
TSQLOrder = class(TSQLRecord)
...
published Client: TID read fClient write fClient;
published OrderedBy: TSQLRecordClientID read fOrderedBy write fOrderedBy;
...

Those two published fields would be able to store a Int64 foreign key, and will both have one index created on the database. But their type (TID or TSQLRecordClientID) will make a difference about the deletion process.

By using the generic TID type, the first Client property won't have any reference to any table, so an index will be created for this column, but no deletion tracking would take place.

On the other hand, following the type naming convention, the other OrderedBy property will be associated with the TSQLRecordClient table of the data model. In fact, the ORM will retrieve the 'TSQLRecordClientID' type name, and search for a TSQLRecord class name matching TSQLRecordClassNameID, which is TSQLRecordClient in this case.
As a result, the ORM will create an index for the column, and track any TSQLRecordClient deletion: it will ensure that this OrderedBy property will be reset to 0 for any row pointing to the deleted record. As with TSQLRecord or TRecordReference properties, the ORM won't define a ON DELETE SET DEFAULT foreign key via SQL, but implement it at ORM level.

See http://synopse.info/fossil/info/7f710d10b0

alpinistbg · 2014-11-17 08:49:22

Newbies confused! ON DELETE SET DEFAULT is a FK constraint, not trigger. Was the mORMot uses RDBMS DRI peculiarities to support the persistence? Isn't it all about to go beyond the RDB patterns?

ab · 2014-11-17 09:22:24

Yes, a FK is defined as a constraint, but is implemented internally by the RDBMS with some triggers.

What do you mean exactly by "Was the mORMot uses RDBMS DRI peculiarities to support the persistence?"? (there should be a missing word)
Do you mean that mORMot is supposed to generate DRI SQL statements for each RDBMS it runs on?
Yes, it is.
Theoretically.

BUT it would have some drawbacks:

1. If we rely on the RDBMS triggers for DRI, some modifications (SET DEFAULT or CASCADE) won't be monitored by the ORM.
Some features (like change tracking audit trail, or our upcoming data replication) would not work any more.

2. And in mORMot, we may have tables in one database, with DRI with tables on other databases!
It is not possible to handle with regular DRI statements, which are tied to one DB instance (unless you install some complex transactional engines, which is a nightmare to manage from my experiment).

There are so many potential combinations and use cases, that mixing ORM with SQL was difficult to make.
So we decided, in the first implementation, to handle it at ORM pure level.
We tried to keep it simple, and furthermore testable.
But, of course, we are open to any improvement!
We could eventually let the DB handle DRI constraints, to increase performances, if both tables are on the same DB.
Trans-table ORM for external DBs is somewhat limited now: we rely on SQLite3 virtual tables e.g. for JOINed queries, which are working, but not perfect either.
If you have any idea, feel free to propose your change.

alpinistbg · 2014-11-17 10:14:18

What do you mean exactly by "Was the mORMot uses RDBMS DRI peculiarities to support the persistence?"? (there should be a missing word)

Quite possible! Sorry, English is not my native language.
What I meant was that, I'll be surprised to see mORMot trying to use other than just simple DML expressions by several reasons:

Multiple DB back-ends, each with its own language (and semantic) peculiarities;
Entanglement with a patterns considered as not suitable;
Introduction of a hidden behavior, which cannot be followed at a higher level;
Usage of NoSQL engines lacking most of the properties of the classic RDBMS;

My confusion is about implementing the same features (e.g. FK constraints) on a higher level which is IMHO following the same patterns.

alpinistbg · 2014-11-18 17:39:02

@ab,
Since I'm not sure how I can be helpful, I'll remain available if you have more questions. Meanwhile, I'll try to get more familiar with the framework. Thanks!

ab · 2014-11-18 17:58:23

Thanks to you for your feedback, interest, and very valuable input!

alpinistbg · 2014-11-18 23:22:26

I've found this function for generating the IDs (shouldn't it be Int64?):

function TSQLRestStorageExternal.EngineLockedNextID: Integer; {virtual;}

Please, notice that fEngineLockedLastID is assigned a value in 3 more places. In the following method it is assumed that the generated ID values will be adjacent, which may not be the case and strongly depends on the EngineLockedNextID implementation:

procedure TSQLRestStorageExternal.InternalBatchStop;
...
finally
    if (fBatchMethod=mPost) and (fBatchCount>1) then
      // -1 since fBatchFirstAddedID := EngineLockedNextID did already a +1
      inc(fEngineLockedLastID,fBatchCount-1);
...

Two questions:

Isn't it appropriate to introduce the method EngineLockedNextID earlier in TSQLRest in order to create a seam for a
customizable ID generator?
Can we have a dedicated interface (abstract class) for such a customizable generator?

Regards,

ab · 2014-11-18 23:51:29

Those IDs are here sequential values so would never reach 32 bits.
By design.
And in batch mode a range is allocated for best speed.

But yes we should refactor this part for proper data replication.

ab · 2015-02-04 10:52:24

The CRDT (Commutative Replicated Data) pattern is a pretty nice concept, which we may implement.
Here are some links:
http://pagesperso-systeme.lip6.fr/Marc. … R-6956.pdf
http://highscalability.com/blog/2010/12 … ntrol.html
https://aphyr.com/posts/294-call-me-maybe-cassandra/
https://github.com/aphyr/jepsen
http://en.wikipedia.org/wiki/Conflict-f … _data_type

In fact, the propagation/replication algorithm we detailed in this forum thread may be seen as CRDT.
By ensuring the order of modifications is preserved, we may have a consistent duplication.
Now I will let some time to think about it....

mORMot Open Source

#1 2014-11-12 13:36:35

Data replication in a multiple master environment

#2 2014-11-12 13:57:05

Re: Data replication in a multiple master environment

#3 2014-11-12 14:31:14

Re: Data replication in a multiple master environment

#4 2014-11-12 15:24:54

Re: Data replication in a multiple master environment

#5 2014-11-12 16:52:34

Re: Data replication in a multiple master environment

#6 2014-11-12 18:17:39

Re: Data replication in a multiple master environment

#7 2014-11-12 18:39:15

Re: Data replication in a multiple master environment

#8 2014-11-12 18:44:29

Re: Data replication in a multiple master environment

#9 2014-11-12 23:37:30

Re: Data replication in a multiple master environment

#10 2014-11-12 23:48:25

Re: Data replication in a multiple master environment

#11 2014-11-13 04:30:44

Re: Data replication in a multiple master environment

#12 2014-11-13 16:05:32

Re: Data replication in a multiple master environment

#13 2014-11-13 16:09:07

Re: Data replication in a multiple master environment

#14 2014-11-13 16:44:26

Re: Data replication in a multiple master environment

#15 2014-11-13 16:59:39

Re: Data replication in a multiple master environment

#16 2014-11-13 18:59:39

Re: Data replication in a multiple master environment

#17 2014-11-13 22:20:42

Re: Data replication in a multiple master environment

#18 2014-11-14 07:16:51

Re: Data replication in a multiple master environment

#19 2014-11-14 08:39:38

Re: Data replication in a multiple master environment

#20 2014-11-14 10:17:11

Re: Data replication in a multiple master environment

#21 2014-11-14 11:21:07

Re: Data replication in a multiple master environment

#22 2014-11-14 13:31:33

Re: Data replication in a multiple master environment

#23 2014-11-14 14:59:11

Re: Data replication in a multiple master environment

#24 2014-11-14 15:26:48

Re: Data replication in a multiple master environment

#25 2014-11-14 17:08:23

Re: Data replication in a multiple master environment

#26 2014-11-14 23:32:02

Re: Data replication in a multiple master environment

#27 2014-11-17 08:49:22

Re: Data replication in a multiple master environment

#28 2014-11-17 09:22:24

Re: Data replication in a multiple master environment

#29 2014-11-17 10:14:18

Re: Data replication in a multiple master environment

#30 2014-11-18 17:39:02

Re: Data replication in a multiple master environment

#31 2014-11-18 17:58:23

Re: Data replication in a multiple master environment

#32 2014-11-18 23:22:26

Re: Data replication in a multiple master environment

#33 2014-11-18 23:51:29

Re: Data replication in a multiple master environment

#34 2015-02-04 10:52:24

Re: Data replication in a multiple master environment

Board footer