IBM Spectrum Scale is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it was the filesystem of the ASC Purple Supercomputer which was composed of more than 12,000 processors and had 2 petabytes of total disk storage spanning more than 11,000 disks. Before 2015, Spectrum Scale was known as IBM General Parallel File System (GPFS). Like typical cluster filesystems, Spectrum Scale provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with AIX 5L clusters, Linux clusters, on Microsoft Windows Server, or a heterogeneous cluster of AIX, Linux and Windows nodes. In addition to providing filesystem storage capabilities, Spectrum Scale provides tools for management and administration of the Spectrum Scale cluster and allows for shared access to file systems from remote Spectrum Scale clusters. Spectrum Scale has been available on IBM's AIX since 1998, on Linux since 2001, and on Windows Server since 2008.
Spectrum Scale, then known as GPFS, began as the Tiger Shark file system, a research project at IBM's Almaden Research Center as early as 1993. Tiger Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.
Another ancestor of Spectrum Scale is IBM's Vesta filesystem, developed as a research project at IBM's Thomas J. Watson Research Center between 1992 and 1995. Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance multicomputers with parallel I/O subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logically partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.
Vesta was commercialized as the PIOFS filesystem around 1994, and was succeeded by GPFS around 1998. The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard Unix API: all the features to support high performance parallel I/O were hidden from users and implemented under the hood.
Spectrum Scale has been available on IBM's AIX since 1998, on Linux since 2001, and on Windows Server since 2008.
Spectrum Scale was offered as part of the IBM System Cluster 1350.
Today, Spectrum Scale is used by many of the top 500 supercomputers listed on the Top 500 Supercomputing Sites web site. Since inception, Spectrum Scale has been successfully deployed for many commercial applications including digital media, grid analytics, and scalable file services.
In 2010, IBM previewed a version of GPFS that included a capability known as GPFS-SNC, where SNC stands for Shared Nothing Cluster. This was officially released with GPFS 3.5 in December 2012, and is now known as FPO  (File Placement Optimizer). This allows Spectrum Scale to use locally attached disks on a cluster of network connected servers rather than requiring dedicated servers with shared disks (e.g. using a SAN). FPO is suitable for workloads with high data locality such as shared nothing database clusters such as SAP HANA and DB2 DPF, and can be used as a HDFS-compatible filesystem.
Features of Spectrum Scale file systems include high availability, ability to be used in a heterogeneous cluster, disaster recovery, security, DMAPI, HSM and ILM.
Spectrum Scale is a clustered file system. It breaks a file into blocks of a configured size, less than 1 megabyte each, which are distributed across multiple cluster nodes.
The system stores data on standard block storage values, but includes an internal RAID layer (called Spectrum Scale RAID) that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system. It also has the ability to replicate across volumes at the higher file level.
Features of the architecture include
Hadoop's HDFS filesystem, is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without RAID disks and a Storage Area Network (SAN). Compared to Spectrum Scale:
Storage pools allow for the grouping of disks within a file system. An administrator can create tiers of storage by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high-performance Fibre Channel disks and another more economical SATA storage.
A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
There are two types of user defined policies in Spectrum Scale: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are selected by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files to be deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
The Spectrum Scale policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.