RPi NAS: Extras – Limiting Data Corruption

25 January 2024

Data corruption can occur at any point when using the NAS. In this post we’ll look at a couple of simple measures to reduce the risk of losing data due to data corruption.

This post is part of a series about building a Network-Attached Storage (NAS) with redundancy using a Raspberry Pi (RPi). See here for a list of all posts in this series.


In this post we’re going to focus on silent data corruption 1. In this case silent means that there are errors in a file (e.g. some bits may have flipped) but the operating system or the hard drive controller don’t know about them. There are many reasons why such errors could occur, for example head crashes, vibrations or voltage spikes2.

Data corruption can affect any part of our computer, the CPU, RAM, (hard- or solid-state-)drives, wireless transmissions, etc., and there are protections against it. For example, ECC RAM contains error-correction-codes, RAIDs can use parity bits, and so on.

Another common protection mechanism against data corruption are checksums3 . Let me briefly (and roughly) describe how they work. Before the whole process starts we need to select a checksum function. A checksum function is a function that takes a file4 as input and returns some number (the checksum) as output. That number is saved. At some later point in time (or on a different computer) the checksum can be recalculated (using the same function as before) and the result can be compared to the value of the checksum that was saved previously. If the two don’t match then there is an error in the data (or in the stored checksum). If they do match then there’s a good chance that the data is correct. How good that chance is depends on the checksum function that was used. For example, very simple checksum functions can only detect one bit flip or an odd number of bit flips (the checksum functions normally used are better at detecting errors). Note, although similar, this is not the same as a hash function 5.

In this post we’re going to look at two types of protections that can reduce (but not eliminate) the risk of data corruption on our NAS.

Corruption on Storage Drives

Let’s assume that there’s a file on one of our storage drives that suffers from data corruption. For example, a bit of the data stored on the drive has flipped. Some file systems, like ZFS, have been designed to detect and correct such corruption. Unfortunately, the filesystem we’ve been using in this NAS, Ext4, is not able to detect it.

Note, current implementations of Ext4 can detect errors in the metadata of files but not in the actual data (see here for more information). You can find out if metadata checksums are enabled for your drive with the following command (replace /dev/sda1 by the path to your drive).

tune2fs -l /dev/sda1

The output contains a line that starts with Filesystem features. If the line includes the word metadata_csum then checksums are enabled for metadata6. However, as mentioned above, the actual contents of files are not checksummed.

Luckily there’s an option in Greyhole which allows us to check for corruption of the files’ contents. When a file is first copied by Greyhole a checksum is computed and saved in a database. We can then call the following command

sudo greyhole --fsck --checksums

Greyhole will now read all files on the NAS, compute their checksums and compare them to the values previously saved in the database (this may take a while).

The file that suffers from data corruption (as assumed at the beginning of this section) will probably have a different checksum than the one previously saved in the database. Greyhole will now

  • log the name of the file in /usr/share/greyhole/fsck_checksums.log and
  • check if other copies of the file (saved on other storage drives) are intact.

If at least one of the other copies of the file has a correct checksum then the corrupted file will be replaced by the one that’s intact.

Thus running the command above should reduce the risk of losing files to data corruption. Running it manually is tedious, though, so let’s add a job to cron. Open crontab with sudo crontab -e. To run this check at 3am on the second day of every month, add the following line (or adjust it to your preferences):

0 3 2 * * greyhole --fsck --checksums

Corruption when Copying Data

Errors can also occur when we copy data to the NAS. Simple copy commands like cp don’t check if the data was copied correctly. Some programs do check. For example, in the man pages of rsync (you can open them with man rsync) it says

Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file
is transferred [...]

So if we set up and use rsync as described here, then it will check if the data received on the NAS has the same checksum as the data that was sent to the NAS. After the data has been received on the NAS it will be written to our drives. During the writing process errors can occur but rsync won’t check for those (i.e. it doesn’t re-read the data from the disk to verify the written data’s checksum). Instead it relies on the OS to indicate if data was written successfully.

While none of the above measures can completely prevent data corruption (and they are certainly not as robust against data corruption as ZFS), these measures still help us reduce the risk of corruption taking place.


Footnotes:

  1. The other kind of data corruption, which is detectable, is less risky because when we (the OS) know about an error then it can typically be corrected. E.g. if some packets in a WiFi signal weren’t transmitted correctly (and we know about it), then they can just be resent. ↩︎
  2. See the Wikipedia page about data corruption for more examples. ↩︎
  3. Computing a parity bit can actually be seen as checksum function. ↩︎
  4. Checksums aren’t actually limited to files. You can compute a checksum of pretty much anything (e.g. of a block of data on the hard drive). I’m only mentioning files in the text above for simplicity. ↩︎
  5. There are a few differences but some important difference are intent and uniqueness. A hash function should produce a unique output (number) for each input. The different, unique hash values can then be used for tasks like indexing tables, which is a fast operation. (In practice there will be some non-identical inputs whose outputs are identical, these are called collisions). For a checksum function uniqueness is not important. The task of a checksum is to check the integrity of a file (or block of data) and it’s very possible for two different inputs to have the same output. For example, consider a simple checksum function that adds a parity bit. There are only two possible outputs, 0 and 1. So for any three blocks of data, at least two of them will have the same checksum. ↩︎
  6. I believe metadata_csum is enabled by default in current implementation of Ext4 but I’m not entirely sure about that. ↩︎

This post is tagged as:



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *