We have a mongodb database with thousands of records, each one had to be related to millions of users.
By mongodb document logic db design, we had every record related using an array of userId, something like
{
"record": {
"recordId": ObjectId("123"),
"users": [{"userId": "B045312"}, {"userId": "B8435021"}, ...]
}
}
But when the array of users is too big (>300.000), we hit a limit of mongodb’s row max size of 16MB.
We tested different solutions that had realy bad performance:
Users
collection with an array of records
Users
back, and the operation of recreating the original user list took 2 minutes, this was too much.RecordUser
Many-to-Many relation table with recordId, userId
columns.
The solution we found was a variation of 3.
, with a “custom” partial index:
Sample structure of the data:
[
{
"recordId": "1",
"arrayId": "123",
"userschunk": [{"userId": "B045312"}, {"userId": "B8435021"}, ...],
"order": 1,
"first": "B045312",
"last": "C456783",
},
{
"recordId": "1",
"arrayId": "123",
"userschunk": [{"userId": "C589432"}, {"userId": "C8435021"}, ...],
"order": 1,
"first": "C589432",
"last": "D444432",
},
]
This format of the data in database is similar to a Tree with only one level of data and a fixed size of leave.
Sorting, chunking and inserting in the database was the fastest of all the designs we tried by far. It took about 20 seconds to do all with a 10 million array.
Check if a userId=A012345
is related to a RecordId=123
findOne({
recordId: 123,
first: { $lt: 'A012345' },
last: { $gt: 'A012345' },
'usersChunk.userId': 'A012345'
},
{
$project: _id
}
)
This query uses a compound index in recordId+first+last
and it scans the array in one pass. For chunked arrays of 20.000 items, the result was in about 0.015 seconds to find in a chunked 10 million array.
The original array took more time in being transfered from the database than to be queried. It’s about 100 MB of data and it takes around 2 seconds to get into the application.
We tried to get a solution with chatgpt and friends, but the solutions we got were always combinations of existing things:
I tried hinting chatgpt with the solution we found, but it was never cooperating, even if I was explicit with the solution, it did not understand it and always proposed a combination of the known ones.
AI in the current LLM form will always depend on the training data. And when given a new problem it will try to throw only existing solutions at it, and just merge or tweak them a little bit.
Many times in Software Engineering we get this kind of issues were we need to hack a system in order to make it work like we need with the limited resources we have, and the AI systems seem to not be able to make this, because they reuse only the solutions they have seen.
It seems AI tries to use “template” soutions, and it seems Software Engineers have a capacity to be more maleable and invent things with more variation.
No idea if it is going to be like this in the future of AI folks will be able to overcome this kind of issues, but as of today (2024-03-14) I think SWE are safe.