File Systems O'Plenty Part Three: Object Based Storage

Quick Summary

I wanted to include a table with a summary of features because the article is so long.



Distributed File Systems
File system Networking Features Limitations Example Vendors
NFS/NAS
  • TCP, UDP
  • NFS/RDMA (InfiniBand) very soon
  • Easy to configure and manage
  • Well understood (easy to debug)
  • Client comes with every version of Linux
  • Can be cost effective
  • Provides enough IO for many applications
  • May be enough capacity for your needs
  • Single connection to network
  • GigE throughput is about 100 MB/s max.
  • Limited aggregate performance
  • Limited capacity scalability
  • May not provide enough capacity
  • Potential load imbalance (if using multiple NAS devices)
  • "Islands" of storage are created if you multiple NAS devices
Clustered NAS
  • Currently TCP only (almost entirely GigE)
  • Usually a more scalable file system than other NAS models
  • Only one file server is used for the data flow (forwarding model could potentially use all of the file servers)
  • Uses NFS as protocol between client and file server (gateway)
  • Many applications don't need large amounts of IO for good performance (can use low gateway/client ratio)
  • Can have scalability problems (block allocation and write traffic)
  • Load balancing problems
  • Need a high gateway/client ratio for good performance
AFS
  • Currently TCP only (primarily GigE)
  • Caching (Clients cache data. Plus servers can go down without loss of access to data)
  • Security (Kerberos and ACL)
  • Scalability (Additional servers just increase the size of the file system)
  • Limited single client performance (as fast as data access inside an individual node)
  • Not in wide spread use
  • Uses UDP
Open-source (link)
iSCSI
  • Currently TCP only (primarily GigE)
  • Allows for extremely flexible configurations
  • Software (target and initiator) comes with Linux
  • Centralized storage (easier administration and maintenance)
  • You don't have to use just SCSI drives
  • Performance is not always as fast as it could be
  • Requires careful planning (not a limitation, but just a requirement)
  • Centralized storage (if centralized storage goes down, all clients go down)
Open-source (link)
HyperSCSI
  • Currently IP only (it uses it's own packets)
  • Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP)
  • Allows for very flexible configurations
  • Hasn't been updated in a while
  • Cannot route packets since they aren't UDP or TCP
Open-source (link)
AoE
  • Currently IP only (it uses it's own packets)
  • Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP)
  • Drivers are part of the Linux kernel
  • Uses ATA protocol (really a requirement and not a limitation)
  • Cannot route packets since they aren't UDP or TCP
Open-source (link)
dCache
  • Currently TCP
  • Can use hardware space on all available machines (even clients)
  • Tertiary Storage Manager (HSM)
  • Performance (it's only as fast the local storage)
  • Limited use (primarily high-energy physics labs)
Open-source (link)


Parallel File Systems
File system Networking Features Limitations Vendors
GPFS
  • Currently TCP only
  • Native IB soon (4.x)
  • Probably the most mature of all parallel file systems
  • Variable block sizes (up to 2MB)
  • 32 sub-blocks (can help with small files)
  • Multi-cluster
  • IO pattern recognition
  • Can be configured with fail-over
  • NFS and CIFS gateways
  • Open portability layer (makes kernel updates easier)
  • File system only solution (you can select what ever hardware you want)
  • Pricing is by the server and client (i.e. you have to pay for every client and server)
  • Block-based (has to use sophisticated lock manager)
  • Can't change block size after deployment
  • Current access is via TCP only (but IB is coming in version 4.x)
  • File system only solution (it allows people to select unreliable hardware)
Rapidscale
  • Currently TCP only (primarily GigE)
  • Uses standard Linux tools (md, lvm)
  • Distributed metadata
  • Good load balancing
  • NFS and CIFS gateways
  • High availability
  • More difficult to expand capacity while load balancing
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • Modified file system, modified iSCSI
  • Currently network protocol is TCP (limits performance)
  • Must use Rackable hardware
Rackable
IBRIX
  • Currently TCP only (primarily GigE)
  • Can split files and directories across several servers
  • Can split a directory across segment servers (good for directories that have lots of IO and lots of file)
  • Segment ownership can be migrated from one server to another
  • Segments can be taken off-line for maintenance without bringing the entire file system down
  • Can configure HA for segment fail over
  • Snap shoot tool
  • File replication tool
  • File system only solution (you can select what ever hardware you want)
  • Distributed metadata
  • NFS and CIFS gateways
  • Administration load can be higher than other file systems (some of this is due to the flexibility of the product)
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • Native access is currently only TCP (limits performance)
  • File system only solution (it allows people to select unreliable hardware)
  • Rumors of having to pay for each client as well as segment servers (data servers)
IBRIX
GlusterFS
  • TCP
  • InfiniBand
  • Open-source
  • Excellent performance
  • Can use almost any hardware
  • Plug-ins (translators) provide a huge amount of flexibility and tuning capability
  • Very fast performance
  • File system only solution (you can select what ever hardware you want)
  • No metadata server
  • Automated File Recovery (AFR) and auto-healing if a data server is lost
  • NFS and CIFS gateways
  • Relatively new
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • File system only solution (it allows people to select unreliable hardware)
  • Extremely flexible (it takes some time to configure the file system the way you want it
Open-source (link)
EMC Highroad (MPFSi)
  • TCP
  • Fibre Channel
  • Uses iSCSI as data protocol
  • NFS and CIFS gateways
  • Uses EMC storage so backups may be easier
  • Only EMC hardware can be used
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • Single metadata server
  • FC protocol requires an FC HBA in each node and FC network ($$)
  • Most popular deployments use TCP (limits performance)
EMC
SGI CXFS
  • TCP (metadata) and FC (data)
  • Multiple metadata servers (although only 1 is active)
  • Lots of redundancy in design (recovery from data server failure)
  • Guaranteed IO rate
  • NFS and CIFS gateways?
  • Doesn't scale well on clusters with many nodes
  • FC protocol requires an FC HBA in each node and FC network ($$)
  • Only one active metadata server
  • NFS and CIFS gateways?
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • Restricted to SGI only hardware
  • Only one metadata server
SGI
Red Hat GFS
  • Fibre Channel (FC)
  • TCP (iSCSI)
  • Open-source
  • Global locking
  • Can use almost any hardware for storage
  • Quotas
  • NFS and CIFS gateways
  • Limited expandability (but limit is large)
  • Dependent on RAID groups of disks for resiliency and reconstruction
Open-source (link)


Object Based File Systems/Storage
File system Networking Features Limitations Vendors
Panasas
  • Currently TCP only (primarily GigE)
  • Object based file system
  • Easy to setup, manage, expand
  • Performance scales with shelves
  • Distributed metadata
  • Metadata fail-over
  • Fast reconstruction in the event of a disk failure
  • Disk sector scrubbers (looks for bad sectors)
  • Can restore a sector if it is marked bad
  • Network parity
  • Blade drain
  • NFS and CIFS gateways (scalable NAS)
  • Coupled hardware/software solution (more like an appliance)
  • Have to use Panasas hardware
  • Limited small file performance
  • Kernel modules for kernel upgrades come from Panasas
  • Single client performance is limited by network (TCP)
  • Coupled hardware/software solution (limits hardware choice)
Panasas
Lustre
  • TCP
  • Quadrics Elan
  • Myrinet GM and MX
  • InfiniBand (Mellanox, Voltaire, Infinicon, OFED)
  • RapidArray (Cray XD1)
  • Scali SDP
  • LNET (Lustre Networking)
  • Open-source
  • Object based file system
  • Can use a wide range of networking protocols
  • Can use native IB protocols for much higher performance
  • Excellent performance with high-speed network
  • NFS and CIFS gateways (scalable NAS)
  • File system only solution (you can select what ever hardware you want)
  • Single Metadata server
  • Dependent on RAID groups of disks for resiliency and reconstruction
  • File system only solution (it allows people to select unreliable hardware)
Lustre
PVFS
  • TCP
  • Myrinet (gm and mx)
  • Native IB protocols
  • Quadrics Elan
  • Object base file system
  • Easy to setup
  • Distributed metadata
  • Open-source
  • High-speed performance
  • Can use multiple networks
  • File system only solution (you can select what ever hardware you want)
  • Lacks some of the resiliency of other file systems (but wasn't designed for that same functionality)
  • File system only solution (it allows people to select unreliable hardware)
PVFS


I want to thank Marc Ungangst, Brent Welch, and Garth Gibson at Panasas for their help in understanding the complex world of cluster file systems. While I haven't even come close to achieving the understanding that they have, I'm much better than I when I started. This article, as attempt to summarize the world of cluster file systems, is the result of many discussions between where they answered many, many questions from me. I want to thank them for their help and their patience.

I also hope this series of articles, despite their length, has given you some good general information about file systems and even storage hardware. And to borrow some parting comments, "Be well, Do Good Work, and Stay in Touch."

A much shorter version of this article was originally published in ClusterWorld Magazine. It has been greatly updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

© Copyright 2008, Jeffrey B. Layton. All rights reserved.
This article is copyrighted by Jeffrey B. Layton. Permission to use any part of the article or the entire article must be obtained in writing from Jeffrey B. Layton.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.