We're hiring!

We're actively seeking developers and designers for our Ann Arbor & Detroit locations.

Using Sparse Files as Disks for Networked RAID

In Unix and its variants, devices (disks, peripherals) are treated as files. Modern Linux distributions mount /dev at boot using the devtmpfs file-system and populate the device files dynamically (based on what is present on the system) using udev. Listing the /dev directory shows that the device files appear just like any other file.

Devices are files. That much we understand, but can files be devices, and if so, how?

In this article we look at creating sparse files, assigning them to loop devices and placing them in a software RAID configuration. The final step in the network RAID configuration is moving one (or more) of the files to a remote mount or share.

Creating a Sparse File

A sparse file doesn’t physically consume empty space, yet the file-system will still report the empty space as allocated. Creating a new sparse file is easy using dd.

$ dd of=sparse_file bs=1024 count=0 seek=100K
0+0 records in
0+0 records out
0 bytes (0 B) copied, 2.72e-05 s, 0.0 kB/s

Notice that the output from dd reports that 0 records in and 0 records out. Nothing was read or written. Nonetheless, 100MB file should exist in the current directory.

$ ls -l sparse_file
-rw-r--r-- 1 kyle kyle 104857600 2010-07-29 10:15 sparse_file
$ du -s -B1 --apparent-size sparse_file
104857600    sparse_file

Of course, it’s not really 100 megabytes…

$ du -s -B1 sparse_file
0    sparse_file

Files can be device files

In this example, we created a sparse file (sparse_file) which will will now assign to a loop device. It isn’t necessary to use a sparse file. First take a look at the current loop assignments on the system:

$ sudo losetup -a

If there are no loop assignments, the above command will not display any output. Display the first unused loop device with the following:

$ sudo losetup -f
/dev/loop0

The loop device files are under /dev just like the disk device files.

$ ls -l /dev/loop*
brw-rw---- 1 root disk 7, 0 2010-07-29 08:01 /dev/loop0
brw-rw---- 1 root disk 7, 1 2010-07-29 08:01 /dev/loop1
brw-rw---- 1 root disk 7, 2 2010-07-29 08:01 /dev/loop2
...

Your system may have up to 255 of these loop devices.

Creating a file system

Assign the sparse_file to the first available loop device. /dev/loop0 is the first available, so it becomes our device.

$ sudo losetup -f sparse_file
$ sudo losetup -a
/dev/loop0: [fc00]:4770 (/home/kyle/sparse_file)
$ sudo losetup -j sparse_file
/dev/loop0: [fc00]:4770 (/home/kyle/sparse_file)

Create a file system on the loop device. In this example I create a XFS file system, but you could create whatever you want (ext4, jfs, reiserfs, etc)

$ sudo mkfs.xfs /dev/loop0
meta-data=/dev/loop0             isize=256    agcount=4, agsize=6400 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=25600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=1200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mount the device:

$ mkdir sparse_file_mount
$ sudo sudo mount /dev/loop0 sparse_file_mount/
$ df sparse_file_mount/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/loop0               97600      4256     93344   5% /home/kyle/sparse_file_mount

Since the device was mounted as root, if you wanted to create any files in the new file system you need to

  • Create the files as root, or
  • Change the ownership of the mount to your local user.

(This is left as an exercise to the reader)

Using mdadm for software RAID between two files

After creating two empty files (sparse or otherwise) named disk1 and disk2 and assigning both of these to a loop device, we can use mdadm to establish a RAID configuration. It is not necessary to create a file-system on the loop devices like we did before, we’ll do that on the new device created by mdadm.

$ dd of=disk1 bs=1024 count=0 seek=100K
-output removed-
$ dd of=disk2 bs=1024 count=0 seek=100K
-output removed-
$ ls -l disk*
-rw-r--r-- 1 kyle kyle 104857600 2010-07-29 11:03 disk1
-rw-r--r-- 1 kyle kyle 104857600 2010-07-29 11:21 disk2
$ sudo losetup -f disk1
$ sudo losetup -f disk2
$ sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/loop{0,1}
mdadm: array /dev/md0 started.
$ sudo mkfs.xfs -f /dev/md0
-output removed-
$ mkdir raid_mount
$ sudo mount /dev/md0 raid_mount/
$ df raid_mount/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0                 97536      4256     93280   5% /home/kyle/raid_mount

(Note: in bash, /dev/loop{0,1} is expanded to ”/dev/loop0 /dev/loop1”).

Networked RAID

At this point, setting up the networked RAID is the simple matter of mounting a remote file-system. This can be accomplished with a variety of protocols such as NFS, SSHFS or samba.

Notes on Performance

You can use hdparm to gauge disk performance:

$ sudo hdparm -Tt /dev/md0
 
/dev/md0:
 Timing cached reads:   1772 MB in  2.00 seconds = 886.50 MB/sec
 Timing buffered disk reads:   98 MB in  1.30 seconds =  75.31 MB/sec
$ sudo hdparm -Tt /dev/sda2
 
/dev/sda2:
 Timing cached reads:   1726 MB in  2.00 seconds = 862.92 MB/sec
 Timing buffered disk reads:  196 MB in  3.00 seconds =  65.26 MB/sec

If the raid is setup over a network, the bottleneck is likely going to be bandwidth. With 100Mbps, expect disk reads/writes to be about 10 MB/s (asymptotic maxima at 100 Mbps / 8 bits/byte = 12.5 MB/s)

Further Reading

 

Kyle Gibson (6 Posts)

This entry was posted in DevOps & System Admin. and tagged . Bookmark the permalink. Both comments and trackbacks are currently closed.

3 Comments

  1. N
    Posted August 3, 2010 at 1:23 pm

    Check out drbd for a networked Raid.

  2. Derek
    Posted August 3, 2010 at 1:23 pm

    This is a nice trick, I’ve been doing something similar on a faraway colo machine since it lost a disk. Replaced the failed /dev/sda* with loop devices on NFS-mounted files on a nearby machine. Indeed I get ~10MB/sec performance. I haven’t bothered with the “write-mostly” option as it’s just a low-volume mail/web server. I hadn’t known about using a sparse file – wish I had.

    The issue today is that I’m resyncing my 50GB /home to the NFS/loop device and the machine is on its knees, ~100 load average (not CPU – all WIO), unresponsive to the point of being offline, etc. Yet the I/O to the loop file is crawling at 500KB/sec. In the past, resync has been hard on loadaverage, but at least was short-lived at 10MB/sec. At this rate, I’m looking at ~24hours of downtime.

  3. poe84it
    Posted August 3, 2010 at 1:23 pm

    Check out vblade and AOE (ATA Over Ethernet) techniques for a networked block device mapping with no TPC overhead.