Oracle Solaris 10 09/10: ZFS Highlights

The recently announced Oracle Solaris 10, 09/10 release (no link, page no longer exists) introduced a number of significant upgrades to the ZFS file system.

Ironically, Solaris 10 now comes with a higher ZFS pool version (19 (no link, opensolaris.org no longer exists), at least) than OpenSolaris 2009.06 (14 (no link, opensolaris.org no longer exists)).

So let’s look at some of the key ZFS improvements that came in this update and figure out why they’re so useful.

In this article, you’ll learn more about LUN Expansion, Snapshot Holds, Triple Parity RAID-Z, Log Device Improvements, Pool Recovery, Splitting Mirrors and we’ll discover a new scheduler class!

And as a bonus, we’ll get to watch some videos that explain these features in further detail.

Online Drive Capacity Expansion

A frequent ZFS question is: What happens if the LUN size expands? Will ZFS expand the pool as well?

Previously, ZFS would check the LUN size only at each import (i.e. during boot time) and adjust the pool size to match the size(s) of the LUN automatically.

But first, this wasn’t very comfortable (you’d need to export/import your pool to take advantage of the new size) and sometimes, this wasn’t desired (imagine if you accidentally increased the size of a LUN, then ZFS gobbled up that capacity. Now there’s no way to go back other than migrating you data to a new, smaller pool).

Now, there’s a nice and clean system event, plus a new property to guide ZFS’ behavior when expanding LUNs:

When a drive expands its size, a system event is created. ZFS catches this event, then decides what to do about it, while being online, without requiring an export/import, or a reboot.
The new autoexpand pool property tells ZFS whether it should immediately take advantage of the new capacity, or wait for the admin to do so. To avoid any accidental expansions, this property is set to off by default.
The new zpool online -e command allows system administrators to trigger expansion of the pool after a LUN has been expanded. Now admins can decide by hand, if and when to expand their pools after expanding the underlying LUNs.

Check out the current zpool(1M) (no link, sun.com no longer exists) man page for details.

ZFS Snapshot Holds

If you’re like me, then you’ll have a lot of snapshots. Currently, I only have about 3000 snapshots on my main data pool, because I created it from scratch about a year ago, but there are people with tens of thousands of snapshots out there, and they’re no exception.

And if you have multiple administrators, or some kind of scripting system to delete old snapshots, or both, then you’ll run the risk of a snapshot being deleted right under your nose, while you were still trying to access it, for example during a zfs send/receive operation. Not good.

Introducing Snapshot Holds: A ZFS snapshot hold is simply a tag that you assign to a snapshot:

1	`zfs hold mytag tank/foo/bar@snap-20100913`

You can add as many holds as you like to a snapshot, each with a different tag name. By using the -r flag, you can recursively add holds to all snapshots of all children of the specified filesystem.

Then, if a snapshot has any hold attached to it, it can’t be destroyed.

Thus, snapshot holds serve as a locking mechanism to prevent accidental destruction of snapshots that are still in use. Very useful.

After doing your thing, and before destroying a snapshot, you can remove a hold:

1	`zfs release mytag tank/foo/bar@snap-20100913`

Again, there’s a -r option available.

Triple-Parity RAID-Z

Drive capacity is increasing quickly: Every 1-2 years, you get double the capacity in drives than before, at the same price.

Unfortunately, drive performance doesn’t increase that quickly, for reasons of pure physics: You can’t spin drives faster than the standard 10-15K rpm, otherwise you run into serious problems (of the exploding drives category).

This means: If a drive fails in your RAID-Z set, there’s a pretty good (or more accurately: bad) chance that a second drive will break while you’re still replacing the first one. That’s why we now have double-parity RAID-Z.

And since the discrepancy becomes worse avery couple of years, it’s now time to introduce triple parity RAID-Z: Up to three drives may fail at the same time before you start losing data.

This feature is implemented in a straightforward way:

  zpool create tank raidz3 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0
  zpool add tank raidz3 c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0

And so on.

Watch a video of ZFS developer George Wilson explaining RAID-Z3 here:

Keep in mind though, that using any form of RAID-Z doesn’t really help performance. Don’t be too greedy and make sure you balance capacity vs. performance in your setup according to your business needs. Here are some more tips on ZFS performance.

Log Device Removal And The Logbias Property

ZFS log devices are great, because they help you improve write performance for synchronous loads. Read everything you need to know about ZFS log devices here. Back? Ok, let’s move on.

Now there are two ZIL device improvements in Solaris 10 (remember: Log device = ZIL device = Writezilla):

The logbias property lets you determine whether an external log device is being used for the particular file system or not: In latency mode (the default), it uses the ZIL to minimize latency for synchronous writes. In throughput mode, it doesn’t use the ZIL device for that particular ZFS file system. Instead, it optimizes for maximum throughput of the file system with that particular property. The latter may be the better setting for databases where the logs might be kept separately anyway. Don’t worry: Any extra ZIL devices will still be used for ZFS filesystems that have this property set to latency or not set at all.
Log devices can now be removed. What sounds trivial at first, turns out to be slightly more difficult after all. Well, now it’s finally possible.

Again, George Wilson has more to share on this subject:

ZFS Pool Recovery

Even with ZFS, there’s the possibility that a pool may break in a way that it can’t be imported into the system. The zfs-discuss (no link, opensolaris.org no longer exists) mailing list is full of such cases.

This doesn’t mean that ZFS is bad, it simply means that some hardware, which doesn’t adhere to standards very well, may manage to bork a ZFS pool.

The particular vulnerability here is the write sequence: ZFS assumes that blocks are written to disk in exactly the same sequence as ZFS issued them. This is needed to accurately perform the copy-on-write algorithm that culminates in the final writing of the Uber-Block.

Now if the Uber-Block gets written before the blocks it depends on, and if the power fails at the same time, there’s no way to access the rest of the pool beginning from that Uber-Block.

Unfortunately, this is exactly what happens with cheaper storage systems, e.g. USB disks. They are all optimized for speed, and they don’t hesitate to sacrifice write sequence to achieve it.

Another more enterprise class example involves advanced storage software that performs data replication under the hood. Since the storage system doesn’t know which blocks perform what function in ZFS, it just goes ahead and copies raw block in any sequence it likes. If something goes wrong here, again we’re in trouble.

Fortunately, there’s a way out: ZFS keeps multiple Uber-Blocks, up to 128 (IIRC). So if the most recent one turns out to point to wrong data, we can try with a slightly older Uber-Block, then another one, and so on, until we hit one that can point us to a completely valid pool version.

It’s like rolling back through magic snapshots that have been created just before things went wrong!

Previously, only those very familiar with ZFS data structures could perform this trick. Now, there’s a function called ZFS storage pool recovery:

1	`zpool clear -F`

1	`zpool import -F`

will attempt to roll back through the Uber-Block history, until they find one that points to a valid set of metadata blocks and hence recover the pool to the latest known good state. This may lose you the last couple of transactions, but at least you’ll be able to continue working with the rest of the pool, as if time had been turned back.

George has some more details to add:

Splitting Mirrors

Back in the old days, when the Solaris Volume Manager was still used, splitting a mirror was a very common way to perform backups, clone systems and do other useful stuff.

In ZFS, splitting a mirror wasn’t supported before, until now. Let’s hear from George what he has to say about ZFS pool split:

The details are covered in the zpool(1M) (no link, sun.com no longer exists) man page, as usual.

ZFS System Processes and the New System Duty Cycle Scheduling Class

Some servers with a lot of pools and some ZFS features (think compression, complex checksums, deduplication or encryption in the future) can see quite a lot of load on the CPU due to ZFS transactions.

Previously, CPU usage by ZFS was hard to analyze, because it was hidden inside the kernel. Also, heavy CPU usage due to ZFS was hard to balance against user processes. This could lead to situations in which the server is grinding on disks all the time, while the user doesn’t see any reaction from the system. Just imagine what would happen if you scrubbed dozens of compressed, SHA-256-protected pools at once, on a CPU starved system.

The new version of ZFS in the current Oracle Solaris 10 09/10 release alleviates this situation in a clever way: Each zpool gets its own process for handling all of the pool’s I/O. This allows you to exactly see which pool is responsible for how much CPU utilization.

Also, there’s now a new scheduling class devoted to these kind of tasks: The System Duty Cycle (no link, sun.com no longer exists) class has been specifically created to model system activity in a granular fashion so the scheduler can better balance system vs. user utilization of the CPU.

Solaris has a very highly sophisticated scheduler, and I remember during the CeBIT trade show days, there would always be a couple of students approaching the Sun booth, asking for details on Solaris scheduling because they were working on a paper on scheduling or a master thesis. With the latest release of Oracle Solaris 10, the Solaris scheduler became even more sophisticated.

Check out an introduction to the Solaris Process Scheduler (no link, sun.com no longer exists), and a video of George explaining the new ZFS System Duty Cycle Scheduling Class below:

Conclusion

The Oracle Solaris 10 09/10 update is a very significant one for ZFS users, adding a wealth of new options and improvements to an already awesome file system. Make sure to check out the other improvements in the OS as well (no link, sun.com no longer exists) and update your Oracle Solaris 10 installation now to take advantage of them!

Your Take

Which of the above features do you look forward to most? Have you tried them out already? Feel free to share your Solaris 10 09/10 experience below in the comments!

Thanks to Deirdré for capturing George Wilson on video and sharing it on blogs.sun.com/video (no link, sun.com no longer exists)!

This post is obsolete