Hardware failure is a fact of life in the storage infrastructure. To ward off the risks of sudden failure--in which something that was working an instant ago abruptly stops functioning--companies invest heavily in redundant systems, failover capabilities, and backup and restore technologies.

In many organizations, the tape drive is the last line of defense in the storage infrastructure. Administrators back up and archive data on tape so that in the case of sudden failure they can retrieve the data from the tapes. But what if the tape drive fails? Discovering a tape or tape drive failure is typically an unpleasant surprise. Very few administrators have a last line of defense to back up their last line of defense. And yet, no storage administrator wants to be in the position of worrying whether a restore operation will actually work.

To address this danger, tape and tape drive manufacturers are incorporating predictive failure analysis technology in their equipment. Predictive failure analysis alerts administrators when equipment is in danger of failure. As a result, theoretically, technicians can proactively replace suspect tapes and drives, thereby minimizing interruptions to other ongoing processes.

Predictive failure analysis is particularly compelling to tape drive manufacturers. Unlike hard disks, tape technology has two components: the drive and the media. Either component can fail. Some tape drive manufacturers report that as many as 50 percent of all drives that users reporting a malfunctioning device return are actually working correctly--the problems lie in the media. Moreover, because so many technologies are involved in backup and archive operations, pinpointing the exact source of a failure is difficult even with advanced diagnostics. If an enterprise doesn't employ specialists in the technology, the problem becomes even more difficult.

The basis for predictive failure analysis in the tape arena is the medium auxiliary memory (MAM) standard. Developed under the auspices of the T10 Standard body--which includes representatives from Sony, with its AIT technology, and IBM and HP, which have promoted the alternative Linear Tape-Open (LTO) technology--MAM defines how vendors store status information on a tape cartridge's chip. MAM's primary purpose is to speed internal drive operations. Additional space on the chip is set aside to record medium-identification and usage information.

MAM has several interesting characteristics. First, MAM has its own reader for reading MAM information, so a tape drive doesn't necessarily have to read the information on the tape cartridge's chip. Second, by design, MAM bypasses the server's primary memory, so it doesn't add overhead to the overall system.

Typically, drives equipped with MAM capture a wide array of data, including the number of read/write errors, the amount of data read or written to the tape, the time the tape was last used, and associated identification information. Until now, LTO and AIT drive manufacturers' products have read (not captured) that data, then used certain statistical algorithms to assess when a failure might likely occur.

Quantum, a company that champions DLT--which has been the king of the hill in the tape backup and archiving sector--has recently raised the bar in the use of MAM statistics for predictive failure technology. In July, Quantum released its DLTSage technology, which goes beyond the use of statistical algorithms to predict failure. DLTSage captures and records actual error rates in the media, as well as a range of other statistics. Based on this data, which it draws from ongoing operations, the technology can provide a more accurate picture of a tape drive's vulnerability to failure. DLTSage can also give better insight into where a problem is developing--in a specific cartridge or in a specific drive.

The emergence of technologies such as DLTSage signals two important shifts in the marketplace. First, it announces the adoption of the predictive-diagnostics concept throughout the technology stack. When downtime isn't an option, companies can't simply prepare to respond vigorously to system crashes--they must head them off. Second, an interesting ancillary benefit of increased data processing and storage muscle is the ability to cost-effectively capture and store statistics about hardware and software operations. Over time, the marketplace will want vendors to document the performance of their products in a way that facilitates more effective management.