btrfs - A modern copy on write(CoW) filesystem for Linux

Introduction

btrfs is a modern copy on write (CoW) filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Its main features and benefits are:

  • Snapshots which do not make the full copy of files
  • RAID - support for software-based RAID 0, RAID 1, RAID 10
  • Self-healing - checksums for data and metadata, automatic detection of silent data corruptions

Development of Btrfs started in 2007. Since that time, Btrfs is a part of the Linux kernel and is under active development.

Copy on Write (CoW)

  • The CoW operation is used on all writes to the filesystem (unless turned off, see below).

  • This makes it much easier to implement lazy copies, where the copy is initially just a reference to the original, but as the copy (or the original) is changed, the two versions diverge from each other in the expected way.

  • If you just write a file that didn’t exist before, then the data is written to empty space, and some of the metadata blocks that make up the filesystem are CoWed. In a “normal” filesystem, if you then go back and overwrite a piece of that file, then the piece you’re writing is put directly over the data it is replacing. In a CoW filesystem, the new data is written to a piece of free space on the disk, and only then is the file’s metadata changed to refer to the new data. At that point, the old data that was replaced can be freed up because nothing points to it any more.

  • If you make a snapshot (or a cp –reflink=always) of a piece of data, you end up with two files that both reference the same data. If you modify one of those files, the CoW operation I described above still happens: the new data is written elsewhere, and the file’s metadata is updated to point at it, but the original data is kept, because it’s still referenced by the other file.

  • This leads to fragmentation in heavily updated-in-place files like VM images and database stores.

  • Note that this happens even if the data is not shared, because data is stored in segments, and only the newly updated part of a segment is subject to CoW.

  • If you mount the filesystem with nodatacow, or use chattr +C on the file, then it only does the CoW operation for data if there’s more than one copy referenced.

  • Some people insist that Btrfs does “Redirect-on-write” rather than “Copy-on-write” because Btrfs is based on a scheme for redirect-based updates of B-trees by Ohad Rodeh, and because understanding the code is easier with that mindset.

Filesystem creation

A Btrfs filesystem can be created on top of many devices, and more devices can be added after the FS has been created.

By default, metadata will be mirrored across two devices and data will be striped across all of the devices present. This is equivalent to mkfs.btrfs -m raid1 -d raid0.

If only one device is present, metadata will be duplicated on that one device. For HDD mkfs.btrfs -m dup -d single, for SSD (or non-rotational device) mkfs.btrfs -m single -d single.

mkfs.btrfs will accept more than one device on the command line. It has options to control the RAID configuration for data (-d) and metadata (-m). Valid choices are raid0, raid1, raid10, raid5, raid6, single and dup. The option -m single means that no duplication is done, which may be desired when using hardware RAID.

# Create a filesystem across four drives (metadata mirrored, linear data allocation)
$ mkfs.btrfs -d single /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Stripe the data without mirroring, metadata are mirrored
$ mkfs.btrfs -d raid0 /dev/sdb /dev/sdc

# Use raid10 for both data and metadata
$ mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Don't duplicate metadata on a single drive (default on single SSDs)
$ mkfs.btrfs -m single /dev/sdb

Once you create a multi-device filesystem, you can use any device in the FS for the mount command. The btrfs file system size is the total size of the devices to create it.

$ mkfs.btrfs /dev/sdb /dev/sdc /dev/sde
$ mount /dev/sde /mnt
$ df -h

The following commands can be used to check filesystem usage.

$ btrfs filesystem show
$ btrfs filesystem df -h /mnt
$ btrfs filesystem usage /mnt

Filesystem deletion

$ umount -f /mnt
$ wipefs --all -t btrfs /dev/sdb /dev/sdc /dev/sde
$ btrfs filesystem show

Subvolumes and snapshots

Creating subvolumes and snapshots are the commonly used operations for btrfs.

Create btrfs filesystem with two disks:

$ mkfs.btrfs -d raid0 /dev/sdd /dev/sdf
btrfs-progs v4.9.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               6a154e4e-61d4-474e-9839-50d1fcd50bbb
Node size:          16384
Sector size:        4096
Filesystem size:    1.75TiB
Block group profiles:
  Data:             RAID0             2.00GiB
  Metadata:         RAID1             1.00GiB
  System:           RAID1             8.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1   894.25GiB  /dev/sdd
    2   894.25GiB  /dev/sdf

$ mkdir /mnt/pool1
$ mount -t btrfs /dev/sdd /mnt/pool1
$ df -h  | egrep "Filesystem|pool1"
Filesystem                 Size  Used Avail Use% Mounted on
/dev/sdd                   1.8T  4.3M  1.8T   1% /mnt/pool1
$ ls -la /mnt/pool1
total 16
drwxr-xr-x  1 root root  0 Mar 22 21:19 .
drwxr-xr-x. 7 root root 69 Mar 22 21:20 ..

Create a subvolume:

$ btrfs subvolume create /mnt/pool1/subvol1
Create subvolume '/mnt/pool1/subvol1'

$ mkdir /mnt/testbtrfs
$ mount -t btrfs -o subvol=subvol1 /dev/sdd /mnt/testbtrfs

$ ls -la /mnt/pool1
total 16
drwxr-xr-x  1 root root 14 Mar 22 21:19 .
drwxr-xr-x. 8 root root 86 Mar 22 21:22 ..
drwxr-xr-x  1 root root  0 Mar 22 21:21 subvol1

$ ls -la /mnt/pool1/subvol1/
total 16
drwxr-xr-x 1 root root  0 Mar 22 21:21 .
drwxr-xr-x 1 root root 14 Mar 22 21:19 ..

$ ls -la /mnt/testbtrfs
total 0
drwxr-xr-x  1 root root  0 Mar 22 21:21 .
drwxr-xr-x. 8 root root 86 Mar 22 21:22 ..

Create snapshot of the subvolume:

$ btrfs subvolume snapshot
btrfs subvolume snapshot: too few arguments
usage: btrfs subvolume snapshot [-r] [-i <qgroupid>] <source> <dest>|[<dest>/]<name>

    Create a snapshot of the subvolume

    Create a writable/readonly snapshot of the subvolume <source> with
    the name <name> in the <dest> directory.  If only <dest> is given,
    the subvolume will be named the basename of <source>.

    -r             create a readonly snapshot
    -i <qgroupid>  add the newly created snapshot to a qgroup. This
                   option can be given multiple times.

$ btrfs subvolume snapshot /mnt/pool1/subvol1 /mnt/pool1/subvol1/subvol1-snap
Create a snapshot of '/mnt/pool1/subvol1' in '/mnt/pool1/subvol1/subvol1-snap'

$ ls -la /mnt/pool1/subvol1/
total 16
drwxr-xr-x 1 root root 24 Mar 22 21:26 .
drwxr-xr-x 1 root root 14 Mar 22 21:19 ..
drwxr-xr-x 1 root root  0 Mar 22 21:21 subvol1-snap

$ ls -la /mnt/pool1/subvol1/subvol1-snap/
total 0
drwxr-xr-x 1 root root  0 Mar 22 21:21 .
drwxr-xr-x 1 root root 24 Mar 22 21:26 ..

$ ls -la /mnt/testbtrfs/
total 0
drwxr-xr-x  1 root root 24 Mar 22 21:26 .
drwxr-xr-x. 8 root root 86 Mar 22 21:22 ..
drwxr-xr-x  1 root root  0 Mar 22 21:21 subvol1-snap

$ ls -la /mnt/testbtrfs/subvol1-snap/
total 0
drwxr-xr-x 1 root root  0 Mar 22 21:21 .
drwxr-xr-x 1 root root 24 Mar 22 21:26 ..

List the subvoluems and snapshots:

$  btrfs subvolume list /mnt/pool1
ID 258 gen 10 top level 5 path subvol1
ID 259 gen 9 top level 258 path subvol1/subvol1-snap

Reference