Some RAM Testing Strategies
This article discusses some testing strategies for RAMs. We focus here on
the tests for packaged ICs, not on the wafer level (this is done by the
manufacturers and they have their specialists for that). So we talk about
faults that can creep in during shipment, handling, mounting and soldering.
This means that we don not try to detect all possible faults, but
limit our tests to the most probable ones.
Why can't we test for all faults? Because this would take too much
time. Although for sizes up to some hundred bytes this is feasable and
sometimes even advisable, even for a 32kB RAM we would need in excess of
10 exp 500 tests.
Even if we could do one every nanosecond this complete test would take
longer than the universe exists. (The manufacturers have their own straetgies
to test the chips for defects effectively and quickly and repair them. But
these tests operate on the bare die and we cannot use them on the board level.)
We need to build a hierarchical testing strategy. The first step is to
consider the physical arrangement of the memory subsystem. Do we need to
test it to the chip level or is it sufficient to test only to the module
level? In a PC we can probably test to the module level and do only some
tests to verify that the module is still ok. But it is a waste of time
(both, development time and testing time) to retest the whole module at
every boot.
On the other hand, when we build our own memory subsystem from "naked" ICs,
we need to test to the chip level. This includes the address decoder and all
buffer chips, although these tests are, to a lesser extent, also neccessary
for a module. The next step is have a look at the address and data bus.
Do other chips share the same physical lines? That means, that they
are electrically connected to the same lines. If yes, we must be very
cautious not to misinterpret the results from our tests. We might find the
RAM to be faulty, but the real fault is some other chip, whose bus interface
or address decoder is defect.
Do we need a detailed report on the type and (possible) location of the
fault or is it sufficient to report on a go/no-go basis whithout much
detail? The former is much more difficult because of the presence of other
chips in the system, while the latter is usually sufficient for a test at
boot time. The majority of all systems will use this type, so we concentrate
on them.
Let us start with the most basic ones. Unless otherwise noted it
is very advisable to do them in the order presented here. The term
word refers to the width of the data path to the RAM, e.g. if we have a
byte wide chip we use 8 bits. A word must be written or read in one atomic
operation! So, it is not allowed to write a 16 bit word in two byte
operations!
Data Lines: "Walking One"
For this test we choose an arbitrary address and make all tests with
this address. The highest or the lowest address are preferred
candidates, because then all address lines are at the same level. Although
this is not mandatory, it is advisable. What we do is, to set one data line
at a time to a "1"
and all others to "0". We start with D0 and proceed to the highest data
line. In each cycle we write a full word and read it back. When we get
back the identical data we go to the next data line and repeat the write
and read cycle until all data lines have been tested. Then we proceed with
the next test, the
Data Lines: "Walking Zero"
This test is analog to the above, but this time with all data lines set to
a "1", except the line under test. Do not skip this test! You cannot rely on
the fact, that if two lines are cross-connected the previous test would
already have detected that.
Data Lines: "Exhaustive Test" (Optional) (not cheap)
For 8 or 16 bit data paths it is reasonable to do an exhaustive test, but
not for wiider ones, because the test time stands in no relation to the
additional insurance. The above two tests uncover more than 99% of all
faults that this exhaustive test will find.
Instead of using only 2N tests we do 2**N ones, by writing every
possible combination. We start with zero and count up to the maximum or vice
versa. Obviosly this takes much more time, but it finds cross-connects of
more than two lines. This is a very rare fault, but in some situations it is
justified to spend this time, as in a high availability system.
Summary
What do we know at this stage? We can be sure, that the data path to the
RAM is ok. About all other things we cannot make any assumption, even not
that we really talked with the RAM at all! We might simply have read back
our data from the buffer chips, instead of the RAM.
Address Lines: "Walking One"
The name is perhaps misleading, but it describes the principle. Here we
do the following: (the data we write is arbitrary; choose one value and
stick to it, e.g. AAAA. But do not choose the address as data. This
would spoil the tests completely, explained below.)
- Set one address line to "1" and write the data.
- Read the data back from this and all other addresses with one address
bit set to "1" and the others set to "0".
So we write to address 0001 our data. Then we read addresses 0010, 0100
and 1000. In none of these addresses our data should appear. (If the chip is
not set to known contents at power up we first write some "empty" pattern to
all locations. Zeros are a good candidate for this.) Finally we read address
0001 and check for the presence of our data. Do not do this read
immediately after you wrote it. Access at least one other address between
the write and the read.
Repeat that test with another start address with one bit set until all bits
have been used. Note: Although theoretically you could omit all addresses
for the read back, that have been used before as start address, it is not
unreasonable to include them for the sake of completeness. And this does not
cost much more time.
Address Lines: "Walking Zero"
Now we do the previous test with the "1" and the "0" exchanged. All
addresses have only one bit set to "0".
The following tests are alternatives and can be done at any time after the
data line tests.
Extended Walking Tests
Another possibility is
to do extended walking tests. Here we use groups of contigous ones or zeros
that we shift by one position in each round. So we start with 0011 as the
first start address, then 0110, 1100 and then 1001. The next test is done
with three contigous ones, e.g. 0111, 1110, 1101, and so on. (Note: if you
decide to use this strategy you can include the above two walking tests as
special cases of this test, because you can start this test with 0001 (i.e.
the walking one) and end it with 1110 (i.e. the walking zero). This test
needs N*N steps and is rather cheap, as with 32 address lines you need only
1024 rounds, each with 64 memory accesses, for a total of 64k access
operations.) After completion of this test you can be pretty confident, that
the address bus is ok. Together with the data line tests you can rest assure,
that you caught more than 99% of all faults and more than 99.99% of the
probable ones.
Statistical Test
Or you do a statistical test by picking in each round one random address as
the start address and some other random adresses as the targets from which
you read from. Of course, you can do the writes to the read targets as you
could do it in the walking tests. Do not forget to check at the end of each
round, that the original data is still in the start address. The advantage
of this test is, that you can reach an increasingly higher coverage by
letting the test to run longer. As a rule of thumb read as many locations in
each round as there are address lines. But it does not pay off to read many
more, either.
Note: Some of these tests you can even execute on a running system,
although
only in a somewhat limited fashion. You reserve the addresses (blocks) from
the
memory controller (subsystem) for each round. This forces you to implement
it tightly into the memory management code. But for a high-availability
system it is a powerful technique to detect failures at runtime sometimes
even before they manifest them otherwise.
© Paul Elektronik, 2002