Some RAM Testing Strategies

This article discusses some testing strategies for RAMs. We focus here on the tests for packaged ICs, not on the wafer level (this is done by the manufacturers and they have their specialists for that). So we talk about faults that can creep in during shipment, handling, mounting and soldering. This means that we don not try to detect all possible faults, but limit our tests to the most probable ones.

Why can't we test for all faults? Because this would take too much time. Although for sizes up to some hundred bytes this is feasable and sometimes even advisable, even for a 32kB RAM we would need in excess of 10 exp 500 tests. Even if we could do one every nanosecond this complete test would take longer than the universe exists. (The manufacturers have their own straetgies to test the chips for defects effectively and quickly and repair them. But these tests operate on the bare die and we cannot use them on the board level.)

We need to build a hierarchical testing strategy. The first step is to consider the physical arrangement of the memory subsystem. Do we need to test it to the chip level or is it sufficient to test only to the module level? In a PC we can probably test to the module level and do only some tests to verify that the module is still ok. But it is a waste of time (both, development time and testing time) to retest the whole module at every boot.

On the other hand, when we build our own memory subsystem from "naked" ICs, we need to test to the chip level. This includes the address decoder and all buffer chips, although these tests are, to a lesser extent, also neccessary for a module. The next step is have a look at the address and data bus.

Do other chips share the same physical lines? That means, that they are electrically connected to the same lines. If yes, we must be very cautious not to misinterpret the results from our tests. We might find the RAM to be faulty, but the real fault is some other chip, whose bus interface or address decoder is defect.

Do we need a detailed report on the type and (possible) location of the fault or is it sufficient to report on a go/no-go basis whithout much detail? The former is much more difficult because of the presence of other chips in the system, while the latter is usually sufficient for a test at boot time. The majority of all systems will use this type, so we concentrate on them.

Let us start with the most basic ones. Unless otherwise noted it is very advisable to do them in the order presented here. The term word refers to the width of the data path to the RAM, e.g. if we have a byte wide chip we use 8 bits. A word must be written or read in one atomic operation! So, it is not allowed to write a 16 bit word in two byte operations!

  1. Data Lines: "Walking One"

    For this test we choose an arbitrary address and make all tests with this address. The highest or the lowest address are preferred candidates, because then all address lines are at the same level. Although this is not mandatory, it is advisable. What we do is, to set one data line at a time to a "1" and all others to "0". We start with D0 and proceed to the highest data line. In each cycle we write a full word and read it back. When we get back the identical data we go to the next data line and repeat the write and read cycle until all data lines have been tested. Then we proceed with the next test, the

  2. Data Lines: "Walking Zero"

    This test is analog to the above, but this time with all data lines set to a "1", except the line under test. Do not skip this test! You cannot rely on the fact, that if two lines are cross-connected the previous test would already have detected that.

  3. Data Lines: "Exhaustive Test" (Optional) (not cheap)

    For 8 or 16 bit data paths it is reasonable to do an exhaustive test, but not for wiider ones, because the test time stands in no relation to the additional insurance. The above two tests uncover more than 99% of all faults that this exhaustive test will find.

    Instead of using only 2N tests we do 2**N ones, by writing every possible combination. We start with zero and count up to the maximum or vice versa. Obviosly this takes much more time, but it finds cross-connects of more than two lines. This is a very rare fault, but in some situations it is justified to spend this time, as in a high availability system.

  4. Address Lines: "Walking One"

    The name is perhaps misleading, but it describes the principle. Here we do the following: (the data we write is arbitrary; choose one value and stick to it, e.g. AAAA. But do not choose the address as data. This would spoil the tests completely, explained below.) So we write to address 0001 our data. Then we read addresses 0010, 0100 and 1000. In none of these addresses our data should appear. (If the chip is not set to known contents at power up we first write some "empty" pattern to all locations. Zeros are a good candidate for this.) Finally we read address 0001 and check for the presence of our data. Do not do this read immediately after you wrote it. Access at least one other address between the write and the read.

    Repeat that test with another start address with one bit set until all bits have been used. Note: Although theoretically you could omit all addresses for the read back, that have been used before as start address, it is not unreasonable to include them for the sake of completeness. And this does not cost much more time.

  5. Address Lines: "Walking Zero"

    Now we do the previous test with the "1" and the "0" exchanged. All addresses have only one bit set to "0".

The following tests are alternatives and can be done at any time after the data line tests.

Note: Some of these tests you can even execute on a running system, although only in a somewhat limited fashion. You reserve the addresses (blocks) from the memory controller (subsystem) for each round. This forces you to implement it tightly into the memory management code. But for a high-availability system it is a powerful technique to detect failures at runtime sometimes even before they manifest them otherwise.

© Paul Elektronik, 2002