PC Week Labs uncovers ways to recover NOSes when disaster strikes and explains how to prevent disasters in the first place

By Erick Von Schweber for PC Week Online

Hal 9000, the computer in "2001: A Space Odyssey," advised a crew member that all computer anomalies were ultimately attributable to human error. Although current statistics don't corroborate HAL's statement, they do show that human error can certainly make a bad situation worse.

Human error is the cause of disk- and tape-related failures in only 32 percent of all cases, according to Ontrack Data International Inc., a company that has tracked data loss and corruption for several years. However, although human error trailed electromechanical hardware failures as the primary cause of data loss in Ontrack's research, administrators and staff operating beyond their knowledge, tools and skills often make recovery efforts far more difficult or sometimes even impossible.

Taking minimum precautions can have maximum benefits if network data recovery ever becomes necessary. A data recovery operation is the last line of action, however: It is far superior to routinely perform backups, check that they are complete and accurate, and periodically perform a restore to ensure that there are no holes in the backup/restore scenario.

PC Week Labs found that, although no NOS (network operating system) file system is impervious to corruption, Microsoft Corp.'s Windows NT file system and Novell Inc.'s NetWare have a small lead over Unix file systems in their resilience after an operator error.

RAID systems, on the other hand, provide an increase in average reliability at the expense of a far more complex data recovery operation, should it be necessary, and are far more susceptible to operator error. Users of Unix or RAID systems on any NOS should be sure their operators are well-trained and experienced.

A NOS file system is larger and more complex than the DOS-based file systems it displaces, and the data contained on the network is usually of greater value. Thus, a $39.95 DOS or Windows utility that might be useful for recovering a slide presentation from a laptop is simply inappropriate for use on a corporate network.

Common problems

All NOSes are susceptible to electromechanical failures. In these cases, it is essential to turn off the failed device as quickly as possible. In no case should disk scanning or repair software be used on a physically damaged device, because this can exacerbate corruption and reduce the likelihood of recovery.

Although less prevalent than hardware failure, buggy software and computer viruses such as Michelangelo or Stoned can corrupt a network drive's disk map. Disk map data can often be recovered, provided that overzealous, underskilled staff don't compound the problem by attempting to make repairs such as reinstalling the NOS, which can complete the job that the virus insidiously began.

Often, the greatest impediment to effective data recovery is the administrator who, feeling responsible for the data loss, haphazardly employs a NOS disk-scanning and repair utility. This can cause additional corruption and loss.

NetWare emergencies

Each NOS supplies its own version of recovery utilities. When disaster strikes, users of NetWare turn to VRepair, a disk recovery utility included with the NOS. VRepair is suitable for repairing many disk errors, but can cause additional corruption if required to correct thousands of drive errors at a time.

Administrators would be wise to run VRepair manually in "check only" mode, logging errors to a file without attempting to make fixes. Providing that the repairs are not too numerous, administrators can run VRepair again to perform the fixes.

If a network volume becomes invisible or unmountable, network technicians often attempt to re-create the volume. This overwrites the FAT (file allocation table), making professional data recovery services mandatory when a knowledgeable administrator might have otherwise remedied the situation. The only files a recovery engineer can be reasonably sure of recovering are those of a size equal to or smaller than the block size selected when the volume was first created (64KB for NetWare 4.0 and higher, 4KB or 8KB for older releases).

NetWare managers should be aware of one additional peculiarity. Should an abend occur during the backup of a volume with a mirrored set, not only will the backup become corrupted, but so might the Volume Descriptor table, the Hotfix table and the mirror table. A safer procedure is to remove the mirror set for the backup and then regenerate it afterward.

Windows NT corruption

Administrators running Windows NT face similar prospects in terms of the types and causes of corruption and the utilities available to remedy them. By default, the installation of NT configures Check Disk to run automatically upon boot-up; at this time, it will scan the drives and attempt to make repairs if it discovers problems--without prompting the user or requesting confirmation.

Because Check Disk can make bad data corruption worse, system administrators should reconfigure NT to suppress the unconditional execution of Check Disk, which can be run manually as a "scan-only" utility.

Technicians who think that a case of disappearing data can be rectified by reinstalling NT should think again. Reinstalling the operating system will overwrite the Master File Table, magnifying the original problem. Although this is a likely candidate for successful professional data recovery, it is easily avoided. Under NT, even a high-level disk reformat is recoverable. NT scatters file records across disks for all NT versions, so professionals can search for records and rebuild the Master File Table.

The same cannot be said for recovering deleted files. Files can likely be undeleted from a nonfragmented NT 3.51 disk, but they are nearly impossible to undelete under NT 4.0. Such is progress.

System upgrades, such as expanding a disk farm, should be approached with particular care. Information in the system registry should be recorded and safeguarded. This could be essential to reconstruct a RAID set from a collection of individual drives after an upgrade.

NT offers a facility to create an emergency backup repair disk, which can be used to rebuild the system registry and the partition table. A backup repair disk is highly valuable to data recovery engineers. A new Emergency disk should be created using the NT Disk Administrator whenever the disk configuration changes.

Unix utilities

When Unix operators can't see or mount a file system, they sometimes run makefs, a Unix utility used to create a file system. This is a major mistake, because makefs rewrites the inodes of the file system where Unix stores the file structure, leading to lost data that is practically unrecoverable, even for professional recovery engineers.

In less catastrophic circumstances, having an Emergency Boot Disk, similar to NT's Emergency Backup Repair Disk and produced at installation time, is highly recommended.

Administrators should note, however, that each vendor's Unix implementation is unique, varying in the actual on-disk data structures while sharing the command line and programmatic interfaces, functions and utilities that define Unix. So, although different Unix vendors' file systems look the same, they differ under the hood. Administrators should seek skilled, vendor-specific advice before attempting a data recovery operation on Unix.

It's RAID!

RAID devices come with their own recovery issues. Perhaps most significant is that administrators think their data is safe on a RAID system, and therefore they don't perform backups. This is a major mistake. RAID systems, although less frequently victims of data loss than single, unstriped drives, feel such a disaster harder when it happens.

Unskilled technicians often replace the wrong drive when a RAID reports a disk failure, swapping out the second drive when the third drive has failed, based on a misunderstanding of drive numbering that designates the first drive as drive 0, not 1. Should a cooling fan fail, the entire array can become overheated, wiping out multiple drives beyond the RAID system's ability to reclaim the data.

Such an event requires data recovery on the entire RAID system, not just a single drive. Even performing a simple RAID upgrade, such as replacing the controller card, can have dire consequences, because data and parity striping and drive order information are often stored on the controller's ROM chip. When the card is pulled, the configuration goes with it, and so, too, goes the access to the data.

What to expect from a data recovery service

What actions should be taken when data recovery is necessary? First, skilled technicians should be assigned as soon as data loss or corruption is suspected that cannot otherwise be recovered by a standard restore from a current backup. This is the best assurance that the damage will be minimized and contained.

A professional recovery service will first assess the client's situation and advise the client on what action to take. Most of the time, drives will need to be shipped off to the recovery specialist. Tools providing low-level access to the drive will extract and restructure lost and corrupted data, which can then be returned to the client on CD, tape or a drive equivalent to the failed device.

Some recovery companies, including Ontrack, work with disk manufacturers and OEMs to replace drives still under warranty.

In the future, small special-purpose operating systems, such as Ontrack's Data Advisor 3, will reside alongside the NOS, providing general disk monitoring and remote data recovery diagnosis and repair for non-hardware errors. c

Erick Von Schweber is a principal of Infomaniacs, a think tank in Sedona, Ariz., specializing in technology convergence.


A crash course

Data recovery recommendations for users and network administrators

  • Know the value of your data--staff and spend appropriately
  • Perform regular backups
  • Check your backups--recovery has often been required when backups were routine
  • Allocate resources to periodically restore backups for additional assurance
  • Perform preventive maintenance: Keep machines in a clean, dust-free environment; use disk-scanning utilities; and replace aging drives before total failure occurs
  • Create an emergency repair or boot disk if supported by the NOS
  • Keep disks defragmented--an orderly disk can enhance professional data recovery
  • Don't panic. In the event of data loss or corruption, get several opinions on how to proceed before taking action
  • Keep a log of all actions performed on a downed system
  • Initiate an archive migration plan to ensure long-term data access

