SCSI Troubleshooting Tips
When resposible for the operation of a SCSI configuration you sometimes have
trouble to find the root of a (usually) intermittend fault. Like any other
bus the SCSI bus has some devices attached to it that need to work together.
Any failure in one of the devices or in the cabling/harness can affect the
operation of all devices on the bus. So, how do you locate faults quickly?
The answer is simple: Use some simple testing tools and,
mostly, your brains. For latter to be usuable you need a thorough
understanding of the priciples of the SCSI bus. A short
tutorial on this
site gives you some of the most basic facts. But if that all is new to you,
you should better read a book about the SCSI bus or attend a course. (We
offer courses in SCSI basics and SCSI troubleshooting, see our trainings
partner Onsite Computer.)
This article will describe some easy, but powerful ways to find most of
the problem on a parallel SCSI bus. We do not consider the serial versions
here. The procedures described herein do not need any
specialized equipment - a simple multimeter is all you need. But, of course,
this short essay can not be a substitute for a thorough training.
The most common faults on the parallel SCSI bus
Believe it or not, but the most common faults are the cables, connectors
and terminators. Not that they fail more often than the rest, but they are
typically disregarded and/or mistreated. Let us start with the terminators.
The SCSI bus has a strict definition of where terminators are to be placed:
- There are always two and only two on the SCSI bus. Never may be
more than two, nor only one. (Without any terminator the bus will totally
cease to function, but this fault is easy to detect.) What will the symptoms
be when there are not exactly two terminators? The symptoms in most cases
will be strange error messages in the log file of the type "SCSI phase
sequence error" or "SCSI unexpected bus free" or even a lot of entries
that indicate that just a specific device produces intermittend errors.
Dont't blame the device! It was designed to work on a properly configured
bus, not on one with more noise or too low impedance than specified.
You can easily check how many terminators are there on the bus, at least
for the single ended version of the bus (that is the most common to attach
cheap peripherals):
Take your multimeter, switch it to milliamps, and measure the short
circuit current between any data or control line (but notTermPWR!)
to ground. On the 50pin flat cable with the pinheader connectors simply take
a pair of opposite contacts at either end of the connector. These are either
pins 1 and 2 or 49 and 50. Both pairs have a ground and a signal line. If
you read about 24mA then there are just two terminators on the cable and
you have to investigate further. If you read something less than 18mA there
is only one terminator present and if the value is larger than 30mA three or
more terminators are there. Go and look for them, now. Then proceed to the
next step.
- The two terminator must be located at the physical ends of the
cable. Not somewhere, but really at the ends. What did your test in the
first step show? Is the number correct? Yes, are both at the place where
they are expected to be (at the ends)?
Don't underestimate the importance to the correct location of the
terminators! Like in the first case the symptoms can be very misleading.
When the bus is operating at higher speeds the cable becomes a transmission
line with all it's intricacies. And transmission lines must be properly
terminated, period. Typical error messages look like the ones mentioned
above, eventually blaming an innocent device that simply has the bad luck
to be placed at an inconvenient place on the cable.
So, check that the two terminators are really at the end of the cable.
Do not allow any stub of cable to protrude beyond a terminator. OK, checked
it, but the intermittend faults are still there? Try the next step.
- Do you use active or passive terminators? The standard urges you to
use active terminators at the higher transfer speeds (>10MT/s) and not
to mix active and passive termination. Although this usually presents no
big problem, it sometimes can be. So check it and use active termination
whenever possible.
- How long are all cables combined and how fast do you run the bus?
(Assuming that you have neither a differential nor a SCSI-3 LVD bus.)
There are four ranges: <1.5m, 1.5m to 3.0m, 3.0m to 6.0m and >6m.
If your figure falls into the fourth range you must find a way to reduce
the cable length - there is no easy way around it, unless you spend money
in converters. (Contact your dealer to get more info about the availability
and pricing of these converters.)
If your cables are less than 6m, but more than 1.5m in length you have
two options: First to reduce the total length to less than 1.5m or to limit
the speed of the bus to a maximum of 5 million transactions per second
(5MT/s). On an 8 bit bus this means 5MB/s, on a 16bit (WIDE) one 10MB/s.
Probably not exactly what you like, but unless you can shorten your cables
a reliable operation is not guaranteed with too long cables. In tabular
form the dependency between the allowable cable length and the maximum
transfer speed looks like this:
Cable Length | <1.5m | 1.5m ... 3.0m | 3.0 ... 6.0m
|
Max. Speed | 20MT/s | 10MT/s | 5MT/s
|
As with the misconfigurations in the previous steps the generated error
messages are more confusing than helpful when the cables are too long,
causing false signal transitions at unexpected times.
Be aware of one common pitfall: Your old SCSI controller is replaced by a
modern one and on a sudden some devices (possibly including the brand new
controller) produce errors. This "upgrade" trap is very common. Remember,
that on the SCSI bus all the controllers check all the devices after a
bus reset and will mutually agree on a transfer rate to use. Suppose, you
had an AHA12xx in your system and replace it with an AHA29xx. Additionally
you bought a "Hawk" disk a couple of weeks ago. With the old controller
everythings worked fine, but with the new the disk and/or the controller
produce error messages. This is due to the fact, that now both,
the controller and the drive can speak faster and they will do it unless
you limit the maximum transfer rate to negotiate on the controller.
- Are the cables and the terminators still in a good shape? No bends or
something the like? Especially when a connection is plugged and unplugged
very often the contacts degrade and eventually will be a reason for failures.
Remember, that only some special connectors are designed for more than few
hundred of mating cycles. But these are usually not used here. So when you
disconnect a device once per day you will have worn out the connectors
within one year's time!
And the cables can break, too. Route them in a way that they do not
experience any strain, nor bend them with a radius less than 5 times
the diameter of the cable. For a cable with 10mm diameter this means at
least a bend radius of 50mm.
Your cables and terminators are all ok, but your system still shows
problems? Now, you are about to enter a swamp if you do not have access
to some good equipment. Sure, some errors can be found by carefully
evaluating the error log entries (if they contain more information than
that of Windows NT). The rest of them is very hard to track down. But some
tips are available.
More tricky faults.
- How many and which devices supply TermPWR? Usually this not a critical
factor, but sometimes there are glitches on the power supply of a device
that does not negatively affect the operation of that device but others
on the SCSI bus if that device is supplying TermPWR. Normally you will
select not more than about 3 to 4 devices to supply TermPWR. These will be
those devices without them the whole system would not work anyway. In most
cases these are the controllers, but some rare cases exist when a peripheral
will supply it in addition. In very noisy environments it is advisable to
enable the TermPWR at the devices closest to the terminators. But
never setup a device so that it powers it's onboard terminator by
it's internal power supply. Always let the device supply power to the
cable and feed the terminator off the cable.
- Do you operate very old devices on your bus? This means devices that
were designed to the first spec of SCSI or to the early (prestandard)
versions of SCSI-2. As the requirements changed, although not so much, some
devices can not reliably interoperate with newer ones. Usually the resulting
errors are predictable and associated with the particular controller/device
combination. Check the revision of the standard the device and the
controller are designed to. Sometimes this is not a simple task and you
must guess what the different specs mean and whether they are that what the
standard requires.
- Does your controller understand your devices? This is not always the
case. Some years ago the customers of a computer manufacturer experienced
a strange fault on their RAID sets: Every morning the RAID controller showed
that all of it's disks are dead. After cycling power or resetting the
controller the fault was gone and everything worked fine - until they came
to the office the next morning to find all disks "dead" again. The fault
was simple in principle: The controller and the disks were designed a little
bit differently. The disks, after spinning down after the inactivity period,
responded to a request from the controller as being "not operable". From
their point of view it looked correct, because they first had to spin up
to be really usuable. But from the controllers side this was an error
message that told the controller that the disks detected some internal
failure that prevented them from operating. So the controller marked them
bad, one after the other as they spun down, and refused to give them a
second chance. The quick fix was to prevent the disks from spinning down
at all and the final fix was to upgrade the firmware on the disks and on the
controller.
- Is the grounding of all devices on the bus correct? You can do a quick
check with your multimeter. Switch it to Volts AC, disconnect the cable
from a device and measure whether there is any significant potential between
the ground on the device and the cable. If you read more than some Volts
then switch to Ampere AC and measure the current between the two grounds.
If this current exceeds some few milliamperes you have a problem. Find and
fix the bad ground connection. Usually this happens when a cable runs from
one cabinet or box to another and both are supplied power from different
outlets. It is not uncommon for a power distribution to show significant
potential between different outlets. Although completely harmless to living
beings a couple of volts can introduce a ground shift that prevents a low
voltage signal from being recognized correctly. Ideally there should be
only one common grounding point for the whole system, but local safety
regulations make this impractical or impossible. So try to minimize any
current flow over the ground lines of your bus cable.
If your problems remain you should consult a specialist that has access to
special equipment. Trying to find an intermittend fault without the right
tools is too time consuming and frustrating.
© Paul Elektronik, 1998-2002