About Big Data Support

wac1104 · 2017-08-30 08:43:12

Hi Ab

Is mormot able to build a framework for supporting large data?
example :Hadoop (hdfs+mapreduce, hbase )

ab · 2017-08-30 09:34:36

Of course.
But the best / more scalable architecture doesn't come necessary from the use of a "big data" database like Hadoop, used as centralized store. Whatever the marketing says, it is not magic.
We prefer use a decentralized design, following the CQRS/MicroServices pattern: the data is stored in several places, each MicroService having its own storage, and all services having dialog using publish/subscribe peer to peer SOA real-time notifications (using interfaces and WebSockets). In this architecture, the map/reduce pattern is implemented at SOA level, not at DB level: each node ask for the information it needs, and stores it in its own convenient way - sometimes even with no disk persistence at all, when in-memory is enough.

For instance, we recently used mORMot for a IoT project, involving big data.
Objects/devices are connected 24/7 to a set of server nodes, everyone with its own set of local Sqlite3 databases (events, properties, content, API...).
Some meta-services are able to identify on which node a device is connected (using TSynBloomFilter).
Dedicated MicroServices have their own local storage, e.g. for inventory, data/media processing, or reporting, and here SQlite3 is amazing.
Then we have some settings to send any level of information to a cluster of MongoDB servers, for further analysis.

In fact, you can define as many separated SQLite3 databases as needed: a monolithic SQLite3 instance is not a good idea, when data grows.
Idea is to redirect the TSQLRecord tables to some dedicated SQLite3 instances. And if the table is expected to have a lot of incoming data which is written once and never altered (e.g. events), you could redirect to TSQLRestStorageShard and let the framework generate several Sqlite3 files of small size (e.g. 100MB), so that we can easily purge/download/archive the information - just at file level.
When the data grows, we identified that a lot of content is redundant: you handle a lot of similar text: in this case, introducing TRawUTF8Interning allows to reduce a lot the memory impact of the processed information.
MongoDB allows to have a centralized store of all data, if needed - but it is not where the continuous process is done; these MongoDB databases are mostly written, seldom read. And the whole system should continue to work even if the MongoDB cluster is down, for whatever reason.

In short, this is some king of BigData SOA design.
This results in a cleaner BigData approach than regular NoSQL DB-centric architecture.
mORMot makes it easy to focus on the business, since most of the plumbing is done by the ORM and SOA features of the framework.

Junior/RO · 2017-09-01 15:14:49

@ab

Please give more details about "Some meta-services are able to identify on which node a device is connected (using TSynBloomFilter)."

I have read your blog post about TSynBloomFilter, can you talk about a real world usage?

ab · 2017-09-01 16:39:54

Every device having a serial number (hexadecimal string in practice).
Servers are gathered in groups - e.g. per brand or third-party account.
Each group has (hunderths of) thousands of devices connected.
In each group, a microservice is responsible of maintaining a list of all connected devices, with their serial number, and status (connection, etc...).

But we needed a way to identify on which group a given device is located, by serial number.
It may have been possible to push all serials to a single service or database: it is possible, but it would have made a lot of information, for billions of devices.
Idea is to have each group not push the serial numbers, but the TSynBloomFilter binary reduction of its current serial state.

Then if we want to know where a serial number is located, we first ask the bloom filter.
If there is a single group location, we are 100% sure that the device is located on it. No need to ask the group!
If there are several potential appearance (i.e. a bloom filter collision), we just have to ask the groups where it is possibly there.
With bloom filters, it takes only 10 bits to store each serial number, instead of 16 bytes, with an error collision rate of 1%.
Thanks to this, resource use is much less, in terms of CPU, RAM and network bandwidth.

Junior/RO · 2017-09-04 02:38:21

Great!

zhangguichao · 2023-07-28 04:36:47

If you don't use MongoDB cluster and only rely on sqlite, is there a feasible method for statistical analysis of Big data?:)

ab · 2023-07-28 06:47:00

I would make a SQlite3 read-only copy of the data for analysis, with a dedicated service.
So it would not affect the main DB.
In fact, what we wrote above still applies: "the data is stored in several places, each MicroService having its own storage".

This analysis-only data is likely to contain some additional meta-information, like consolidated or summary values, to ease the search.

zhangguichao · 2023-07-28 08:11:38

Thank you for your answer.

mORMot Open Source

#1 2017-08-30 08:43:12

About Big Data Support

#2 2017-08-30 09:34:36

Re: About Big Data Support

#3 2017-09-01 15:14:49

Re: About Big Data Support

#4 2017-09-01 16:39:54

Re: About Big Data Support

#5 2017-09-04 02:38:21

Re: About Big Data Support

#6 2023-07-28 04:36:47

Re: About Big Data Support

#7 2023-07-28 06:47:00

Re: About Big Data Support

#8 2023-07-28 08:11:38

Re: About Big Data Support

Board footer