To build a reliable backup service, it’s crucial to understand the key considerations within the backup domain.
While it might seem similar to synchronization in terms of uploading and downloading data from a server, backup services have distinct characteristics.

Key Considerations in the Backup Domain

  1. Distinguishing the Start and End of Backup and Restore
  2. Considering Revisions
  3. Managing Backup Lifecycle
  4. Preventing Data Loss
  5. Focusing on Restore
  6. Locks Might Not Be Necessary
  7. Types of Backups
  8. Considering Timing for Automatic Backups
  9. Understanding Restore Usability
  10. Managing Policies

Terminology

Term Meaning
Backup The process of uploading a user’s data from their device to a server.
Automatic Backup A feature that performs backups automatically based on set conditions.
Restore The process of retrieving a user’s saved data from the server back to their device.
Revision A versioning method to differentiate each backup iteration when backups occur multiple times.
Snapshot The data stored during each backup request from a device.
Full Backup A backup process that uploads all data during each backup session.
Incremental Backup A backup process that uploads only new or modified data since the last backup.

1. Distinguishing the Start and End of Backup and Restore

Backups are complete when all data is successfully backed up, and the same applies to restores.
You can create an interface to fulfill these functions, but large backups might involve thousands of API calls, potentially taking hours depending on the network.

Without clear checkpoints, you won’t know if their backup has finished or if it stopped midway.
This is why it’s important to have start and end checkpoints for backups and restores. Without these, it’s difficult to check whether a process has completed or encountered an issue.

The checkpoint is also essential for analyzing performance, issues, and usability. These start/end APIs aren’t just for checkpoints. They can also be used for processing snapshots

2. Considering Revisions

When users request a new backup, it needs to be distinguished from previous backups.
For example, if a backup is in progress and the user requests a restore, the system should use the last completed backup, not the current one.
If the backup process is interrupted, the incomplete backup must not overwrite the previous, completed one.

Therefore, it’s essential to include revision keys in data design, which allows for visibility management and enables async garbage collection (deleting old backups).

3. Managing Backup Lifecycle

Backups involve large amounts of data, so it’s critical to ensure that data deletion happens correctly.
If not, this will directly impact costs, as undetected undeleted data accumulates.

Backups can be stored from days to years, depending on lifecycle policies.
To manage this, async deletions should be executed via batch jobs that are stable and accurate.

Given that deletions occur asynchronously, it’s easy to miss them.
Missing deletion leads to that much cost and additional work.

Due to the nature of backups, snapshots of specific backups are often erased when data is erased.
Therefore, it is better to continue the design until deletion by Considering Revisions.

4. Preventing Data Loss

The most fundamental principle of backup system design is to never lose user data.
Surprisingly, there are many scenarios where data loss can occur, even in production services.

While developing new features, engineers sometimes prioritize cost, efficiency, or operational concerns, leading to decisions that increase the risk of data loss.
However, managing user data means ensuring that data is never lost, no matter the circumstances. It’s crucial to design the system not only to avoid losing data, but also to prevent any situation that could lead to human errors resulting in data loss.

For instance, allowing batch jobs to directly access the data storage can be dangerous.
If the schema or access logic changes, the batch jobs may not reflect those updates.
It will potentially result catastrophic data loss. (as has happened to our team in past incidents).

The goal is to design a system where there are no issues even without constant monitoring, rather than one that requires attention to avoid problems.

5. Focusing on Restore

Just as preventing data loss is important, ensuring safe data restoration is equally critical.

The key difference between backups and synchronization is that backups store data, while synchronization keeps it updated in real time.
Backups only upload data. Even with automatic backups, the only time data is downloaded is during a restore request.
If data is restored incorrectly, there’s no way to correct it until the user makes another restore request, which could be problematic if the client doesn’t handle exclusions for previously restored items.
In some cases, it’s better not to restore data at all than to restore it incorrectly.

Synchronization, on the other hand, provides more opportunities to correct datas because data is updated frequently.

6. Locks Might Not Be Necessary

Typically, backups are device-specific, this means that there is no need to consider simultaneous requests from multiple devices.
No simultaneous requests mean no simultaneous writes, so locks might not be needed, depending on the design.
However, there might be cases where a restore (read) request is made simultaneously with a backup (write) request.

In contrast, synchronization often involves simultaneous writes from multiple devices or users, necessitating the use of locks.

7. Types of Backups

Each team might use different terms, but generally, there are two main types of backups1

  1. Full Backup – A complete backup regardless of previous backups.
  2. Incremental Backup – A backup that only includes data added or changed since the last backup.

You should choose the appropriate type based on the backup data’s requirements.
For our services, which handle various content with different characteristics, we use both types as needed.

8. Considering Timing for Automatic Backups

Nowadays, many services offer automatic backup options.
Automatic backups usually run at times when they won’t impact the user’s experience, typically during the night.

If a client is set to run automatic backups at midnight, all backup traffic will be concentrated at that time.
Although the client handles the timing, the server should also monitor this.

Backup services generally see a significant difference between peak and non-peak API request volumes due to these time-based patterns.

9. Understanding Restore Usability

Restores have their own characteristics.
Unlike automatic backups, restore requests typically occur during the day when users are active.

Users often perform backups multiple times, but restores may only happen once—or not at all.
Therefore, restore requests are much less frequent compared to backups.
However, restores are often more important to users than backups, so the restore function needs to be well-managed.

  • Backup and restore metrics should be managed separately so that restores aren’t overshadowed by backup metrics.

10. Managing Policies

Policies are important across all domains.
For enterprise-level services, not having policies in place can lead to encountering abusive users with behavior far beyond the norm.

For instance, while most users might have around 300 backup items, we’ve seen abusive user trying to store up to 1.5 million items.
If policies aren’t established beforehand, it can be difficult to address such cases after the fact, and handling them may require additional work.