As a developer worked on synchronization servers for several years, my team and I have been responsible for most of the synchronization services provided by our company.
Here are some key points that I’ve learned from developing synchronization servers that handle requests from hundreds of millions of MAU.

Key Considerations in the Synchronization Domain

  1. Whose Synchronization Is It?
  2. Syncing Deletions
  3. Safe and Quick Actual Deletion
  4. Considering Faster Deletions
  5. Handling Synchronization
  6. Managing Initial Synchronization
  7. Ensuring No Misses in Downsync
  8. Considering Partial Errors
  9. Considering Legal Requirements
  10. Managing Policies

1. Whose Synchronization Is It?

The most important aspect of synchronization is whose data is being synchronized.
Understanding the target of synchronization leads to a clear domain design.

Is the synchronization for a single user?
Is it for multiple devices?
Is it for shared use with others?
Or is it a publicly open synchronization?

The type of synchronization determines the data marked on the server.
Policies such as the relationship between the data creator and ownership, and the permissions to view and modify the data, will differ.

2. Syncing Deletions

When a user deletes data, it typically gets removed from the server.
However, in synchronization, deletions must also be synchronized.

If data is deleted on one device, it should also be deleted on other devices.
If one user deletes data, other users should no longer be able to access that data.

Even if multiple devices are involved or if a user synchronizes months later, the deletion should still be reflected in the service.
Syncing deletions often requires minimal metadata for aligning deletions, so it might be okay to delete the actual file or data at this point.

Syncing deletions means distinguishing between synchronized deletions and actual data deletion and not confusing the two.

3. Safe and Quick Actual Deletion

Unlike the deletion to sync, there are actual deletion as well. This often happens when a user leaves.
Even in the deletion synchronization, data excluding meta data may be deleted.

When actual data deletion occurs, mark it for deletion and handle it in batches via events.
Actual data deletion takes longer, but batching allows for quicker responses to users and safer failure handling and retries.

All services should handle deletions safely, but extra care is needed when deleting user data.
It’s crucial to log sufficiently, though not excessively, to handle issues like “My photos were deleted!”

4. Considering Faster Deletions

There are times when all data of a specific user needs to be deleted, such as when a user leaves, due to GDPR’s right to erasure, or due to expiration or service policies.
Deletion may not occur as expected, and if the user syncs during the deletion, there’s a risk they may receive partially deleted data.

Using a concept like revision allows you to make it appear as if the user’s data has been deleted by changing their revision.
This is also useful for batch processing.

For example, raising a user’s revision from 2 to 3 can make it seem as if the data from revision 2 has been wiped.
Even if revision 2 is processed later in a batch, there won’t be any issues.
And the user will feel that the data has been wiped cleanly.

5. Handling Synchronization

The method to handle synchronization on the server side is to use a concept like pageToken.
We handle synchronization proactively on the server using a value we specify as SyncPoint.

It’s best to use an encoded value for SyncPoint, which the client cannot interpret.
This prevents unintended or unauthorized use by the client.

For instance, if you specify SyncPoint=100 for the 100th item, the client might interpret it and attempt to retrieve the 1000th item.

SyncPoint can be used to track up to where a specific user’s client has synchronized. And It can also be used to check schema changes.
By encoding and sending the necessary data for synchronization in SyncPoint, the server can monitor and manage the synchronization progress.

Since the synchronization trigger ultimately comes from the client, it’s essential to have indicators to identify client-side implementation issues.

6. Managing Initial Synchronization

Managing synchronization with SyncPoint also means managing initial synchronization.
Initial synchronization involves receiving all data from the beginning, which means heavy resource usage for both the server and client.

You need to be able to track how much of the initial synchronization the client has completed. If initial synchronization is required due to schema changes, the server should be able to prompt it.
To this end, an initial SyncPoint is provided to the client under certain conditions.

7. Ensuring No Misses in Downsync

You need to confirm the baseline for downsync.
This baseline could be a specific value of SyncPoint and is often related to the modifiedTime of the synchronized item.

Once an issue where items with the same modifiedTime spanned across pagination, causing an item to be missed.
If this happens, the data may not be retrieved unless an initial sync is performed, and the bigger problem is that the development team may not be aware of the issue.

Although rare, since multiple items having the same modifiedTime and being split across pagination is uncommon, it’s necessary to develop downsync in a way that prevents any misses.

8. Considering Partial Errors

During synchronization, it’s common for a user to synchronize multiple items at once.
When multiple items are synchronized, should the entire request fail if some items fail?
Or should the response include both the failed and successful items?

This issue, which is related to how retries are handled, depends on what data is being synchronized.
If the synchronized data is large and heavy, it’s better to design for partial errors.

However, if the data is small or if the range of retries can be reduced (e.g., using a hash before uploading a file), it’s much simpler for both client and server to handle errors without partial error handling.

While errors may not be frequent, the complexity introduced by partial errors is greater than you might expect.

There are legal requirements for managing user data, most commonly adhering to the EU’s GDPR restrictions.
If the domain of service logic is not introduced into the GDPR, management becomes complicated by service changes.

GDPR should be considered if the service is global, and it should be possible to use the same domain as the service logic.

10. Managing Policies

Policies are important in every domain.
In enterprise-level services, if policies aren’t defined, you’ll encounter users who abuse the system beyond common sense.

Heavy users are valuable customers.
Users who go far beyond common sense are not just heavy users. They may generate data that’s typically not possible, often exploiting policy loopholes or bugs.

For example, I’ve seen users with 2 million items that would take at least few second to process for creating a item.
Without policies, enterprise-level services often encounter such data from abusive users.

Unimaginable usage cases always exist, no matter how unlikely.
Services like Instagram or Google Photos have clear policies in place.
If policies aren’t established, it will eventually lead to quality and service issues.