Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS

( ESNUG 492 Item 4 ) -------------------------------------------- [05/26/11]

Subject: A hands-on verification engineer's first eval of Vennsa OnPoint

> Vennsa OnPoint is a new tool for automatic root cause failure analysis.
> It shows "suspects" pointing to the actual bugs in your source Verilog
> with no user direction needed.  Works with formal tools & RTL simulators.
> (booth 250)  Ask for Sean Safarpour.  Freebie: bottle opener
>
>     - from http://www.deepchip.com/gadfly/gad061010.html


From: [ The Man In The Iron Mask ]

Hi, John,

Please keep me anonymous.

Vennsa OnPoint was mentioned a few times on DeepChip, but there has been no
technical report of the tool, so I thought I would share my experience.

How It Works
------------

OnPoint starts with a simulation failure.  It does not simulate the Verilog
RTL itself, but grabs the simulation results from an FSDB file and a
mismatch from the testbench (simulated versus expected value).  It analyzes
the problem and tries to find out what parts of the RTL caused the failure.

The output is a list of what Vennsa calls "suspects" that can be viewed in
their GUI or text-based reports.  Each suspect is a location inside the RTL
that can be changed to fix the bug.

The following is a made up example.

Consider the basic FIFO below.  There is a bug in there.  Can you see it?

     26    always @(posedge clk or posedge rst) begin
     27       if (rst) begin
     28          wp <= 0;
     29          rp <= 0;
     30          count <= 0;
     31       end else begin
     32          case ({push,pop})
     33             2'b01: begin
     34                data[wp] <= data_in;
     35                wp <= wp + 1;
     36                count <= count + 1;
     37             end
     38             2'b10: begin
     39                rp <= rp + 1;
     40                count <= count - 1;
     41             end
     42             2'b11: begin
     43                data[wp] <= data_in;
     44                wp <= wp + 1;
     45                rp <= rp + 1;
     46             end
     47          endcase
     48       end
     49    end

Due to the bug, the testbench triggers a FIFO overflow that is caught by an
assertion or a checker down the line.  To debug the problem with OnPoint,
you have to tell it that the FIFO should not overflow or that the checker
should not fire.  Then OnPoint will run and generate some suspects.

What OnPoint produces in terms of "suspects" are the places in the Verilog
RTL source code where the bug can be fixed.  Basically they're places where
the bug either originates from or traverses through.  More specifically, in
this case it returns the following three suspects (the columns are "suspect"
name, line number, code snippet).

     Suspect        Line #             Code on Line #
    ----------      ------             --------------
    Suspect #1        32            case ({push,pop})
    Suspect #2        35                wp <= wp + 1;
    Suspect #3        39                rp <= rp + 1;

Suspect #1 is where the bug was made by swapping the "push" and "pop".  In
other words the only time the FIFO was working correctly was when data was
being pushed and popped at the SAME time.

Suspect #2 can fix the failure by not incrementing the write pointer and
thus over writing into previous addresses.

Similarly Suspect #3 can fix the failure by moving the read pointer forward
and avoiding the overflow.

I know that this is a simple and silly example, but hope it helps clarify
how OnPoint works.

Our Eval
--------

In our eval, we decided to use OnPoint on a "live" design and on 2 blocks
that had well defined checkers, thus the expected value was easily available
from a VERA model.

The whole process was automated such that OnPoint would run when a failure
occurred on our nightly regressions.  We evaluated OnPoint on the two blocks
and found 76 suspects for 13 failures or error messages.

We did a deep dive on 5 of the most interesting ones where the designers of
the blocks in question went over every one of OnPoint's suspects.

The 2 blocks were:

AXI block:

   This block implements the on-chip communication interface based on
   the AMBA AXI specification.  The AXI block is 77 K lines of Verilog.

FE block:

   The forwarding engine (FE) is a block of a bigger communication chip.
   The FE looks into each received packet to determine its destination
   VLAN, port, QOS, and any packet modifications that may be performed.
   The FE is 582 K lines of Verilog.

Details on the 5 bugs and the 76 suspects found by OnPoint:

      AXI bug 1 - bug in testbench
      ----------------------------
         Bug source: handshaking bug in the testbench
         run-time: 5 mins
         memory used: 1.5 GB
         # of suspects: 3

      AXI bug 2 - bug in source
      -------------------------
         Bug source: state transition bug in design
         run-time: 10 mins
         memory used: 2.5 GB
         # of suspects: 15
         # of 3-star suspects: 5

      FE bug 1 - bug in source
      ------------------------
         Bug source: packet forwarded to wrong address based on header
         run-time: 20 mins
         memory used: 2.5 GB
         # of suspects: 17
         # of 3-star suspects: 6

      FE bug 2 - bug in testbench
      ---------------------------
         Bug source: in testbench (bad address read by testbench)
         run-time: 10mins
         memory used: 1.5GB
         # of suspects: 6

      FE bug 3 - bug in source
      ------------------------
        Bug source: bad field extraction based on table data
        run-time: 25mins
        memory used: 3.0GB
        # of suspects: 35
        # of 3-star suspects: 12

The Verilog simulator we use is Axiom MPSim, version RHEL3-5.0-A1.  For
OnPoint we dumped Springsoft .fsdb format files using an old version of
the Debussy PLI that Axiom supports, version v52v24.  OnPoint needs a
dump file from the failing simulation to do its work.

Gotchas:

I should mention that although OnPoint did its job well, it does not give
superhuman debugging powers.  It can't beat the debugging ability of the
original designer of the block, working together with the original author
of the testbench, to debug a failure.  What it is able to do is to get the
non-expert looking at the right part of the logic quickly.

I should also mention that OnPoint has an important limitation.  It can
only diagnose low-level error messages -- errors that can be stated as:

       "at time T, signal S is 123, but is expected to be 456"

It can't diagnose transaction level errors such as a scoreboard packet
mismatch, or any other message where the cycle-level information is not
available.  A possible workaround is to retain the cycle-level info in
the transaction object (packet) and report it when there is a mismatch.
We did not attempt this in our evaluation.

Conclusion:

  - The location of the bug was always found and it was typically in
    the top 3-5 OnPoint "suspects".

  - It was also easy to distinguish between testbench bugs and RTL bugs
    based on the "suspects" returned.

To summarize, OnPoint did a great job of root cause analysis.  The tool
worked as advertised.

I hope your DeepChip readers will find my assessment useful.

    - [ The Man In The Iron Mask ]

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)