Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS



 Editor's Note:  With the war in Iraq going on and the new SARS virus
 scaring everyone who reads about it, want to know what made front page
 news of my local small town newspaper?  Two baby pigs escaped from a
 farm in the neighboring town of Medway.  It wasn't news at first,
 especially since Baby Pig No. 1 -- they have no names -- was recaptured
 almost immediately.  What made it news was that Baby Pig No. 2 was
 running loose up and down Rt. 126 (a main road in Medway) causing major
 traffic problems.  The piglet was literally running into the street in
 front of oncoming cars and then suddenly running away.  The cops came
 and tried to catch Baby Pig No. 2, but the problem is that baby pigs
 are very smart and very quick on their feet.  Cops aren't.  Eventually
 the cops got out some giant butterfly nets.  Rt. 126 became this weird
 day long scene of the police with these giant butterfly nets walking up
 and down finding, chasing, then losing, then finding, then chasing, then
 losing Baby Pig No. 2.  Needless to say, this didn't help the traffic
 problem on Rt. 126 at all.  That was on Saturday.  Since Monday, no one
 has seen the fugitive Baby Pig No. 2 anywhere, but we at the Holliston
 Poor Farm have been secretly cheering for him.  Go, little piggy, go!  :)

                                             - John Cooley
                                               the ESNUG guy

http://www.metrowestdailynews.com/news/local_regional/medw_pigs03292003.htm


( ESNUG 410 Subjects ) ------------------------------------------ [04/02/03]

 Item  1: A First Customer Look At Prolific's New ProTiming / PrimeTime ECOs
 Item  2: ( ESNUG 409 #8 ) Magma, Cadence/Celestry ClockWise & Useful Skew
 Item  3: ( ESNUG 404 #15 ) Async Resets Are A Nightmare At 0.13 um & Below
 Item  4: ( ESNUG 407 #16 ) Mentor Changing The Eldo Licensing Mechanism
 Item  5: What Are The Speed & Power Advantages of Going From 0.18 To 0.13?
 Item  6: ( ESNUG 409 #6 ) Novas Calls Shenanigans On Veritools Potshots
 Item  7: Dan's First Place DVcon'03 Paper On How To Reduce Random Testing
 Item  8: Rajesh's DVCon'03 Paper On Free Ways To Speed Up Your Verilog Runs
 Item  9: ( ESNUG 408 #1 ) Why Astro-Xtalk & HSIM Results Will Always Differ
 Item 10: Is Vera/VCS Performance Better Than Vera/NC-Sim Performance?
 Item 11: ( ESNUG 393 #4 ) Only Fools & Idiots Use #1 Delays In Their Code

 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com


( ESNUG 410 Item 1 ) -------------------------------------------- [04/02/03]

From: David Parker <user=parkerd  domain=lsil jot palm>
Subject: A First Customer Look At Prolific's New ProTiming / PrimeTime ECOs

Hi John,

Here is a run-down of a recent tool evaluation.  Thought your readers might
be interested.

ProTiming is the new cell resizer point tool from Prolific.  ProTiming bolts
onto PrimeTime and uses your existing STA setup as a starting point for the
optimization (more on this later).  It's resizing optimizations fall into
two categories:

     1) a gate is resized to an existing cell in your library,
        e.g., a NR2X1 is resized to a NR2X2;

and

     2) a gate is resized to an in-between drive size,
        e.g., a NR2X1 is resized to a NR2X1.6.

Since the in-between cells do not exist in the lib prior to optimization,
the tool creates an estimated timing snapshot of the cell, enabling
ProTiming to update the timing as the optimizations occur.

In order to minimize the impact of this evaluation on my engineering staff,
I provided Prolific the necessary files and let Prolific drive their
ProTiming tool.  During the evaluation process, I focused my efforts on
validating ProTiming's resulting output and approach (more on this later).
Since I have not driven the tool, I cannot comment on specific software
issues such as usability, bugs, GUI, run times, etc.  

The Test Design & Methodology

The test case we chose for this evaluation was an ARM processor implemented
in LSI Logic's GflxP 0.13 um standard cell library.  This design contains
~40k cell instances and was taken through LSI's standard timing closure flow
prior to entering the evaluation process.  Our timing closure flow looks
like the following:

   1. synthesis with Design Compiler.
   2. timing-driven placement with MPS. (MPS is LSI's internal placer.)
   3. clock tree insertion with Synopsys (Avanti) GCTS clock tree
      synthesis tool.
   4. timing-driven physical optimization with MRS. (MRS is LSI's internal
      physical optimization tool which performs gate resizing, buffer
      insertion, logic restructuring, etc.)
   5. detail route with Synopsys (Avanti) Apollo.
   6. extract SPEF using Synopsys (Avanti) Star-RCXT.
   7. generate post-layout Verilog
   8. delay prediction with lsidelay (LSI's internal delay calculator.)
   9. Static timing analysis in PrimeTime with lsidelay generated SDF.

The post-layout version of the design was then used as the starting point
for the ProTiming optimization.


The Eval
--------

September 19, 2002

I provided a tarball to Prolific containing the data files required to drive
their ProTiming tool.  The tarball contained:  post-layout Verilog, worst
case SDF file, Synopsys design constraints, SPEF file, .lib and .db for
standard cell libraries and the PrimeTime  script for this design.

October 4, 2002

I received their first set of results and started digging in.  The results
sent by Prolific consisted of two key pieces, a change.tcl file and an
estimated.db file.  Their change file contained all of the resizing ECOs as
determined by ProTiming.  The change file is written to take advantage of
the "size_cell" function in PrimeTime, so one line of the change file may
look like "size_cell {instance_name} {library/cell}".  Their estimated.db
contains the cell info (timing, area, etc., your standard .db stuff) for
the in-between cells.  Using these two files, the ECO changes can be
evaluated in PrimeTime.  The thing to keep in mind here is that this ECO
evaluation is mixing the interconnect context (SPEF file) from the original
post-layout result with the ProTiming ECOs (cell swaps).  To achieve the
predicted performance as shown by PrimeTime, the physical ECO must maintain
a similar interconnect context.  Using PrimeTime, I evaluated the first
round results and was able to replicate the 13% speed increase Prolific
reported.  Digging a little deeper, I determined that a number of the
critical/near critical paths in the design were not properly optimized
during the original timing closure effort.  By moving to a PrimeTime setup
that used ideal clocks and an uncertainty budget, I was able to eliminate
the easy pickings from the table (the poorly optimized paths were associated
with clock gating structures).  Since ProTiming does not alter the clock
structures, this seemed like a reasonable approach throughout the remainder
of the evaluation.

October 7, 2002

I spoke with Prolific about moving from the PrimeTime setup that used
propagated clocks to a PrimeTime setup that used ideal clocks and an
uncertainty budget.  Additionally, we discussed relaxing several of the
input and output constraints to improve optimization results.  This was
particularly important for input and output paths with shallow logic
depths.  For these paths, the majority of the cycle time was tied up in
the input or output delay constraint.  I agreed to relax these delay
constraints by 150 psec.

October 9, 2002

I received a second set of results from Prolific.  The speed-up was now
10% as reported by Prolific (the optimizer had been setup to run with a
targeted goal of 10%).  As I reviewed the results, it became clear that
delay prediction differences between PrimeTime's timing engine and LSI's
timing engine were influencing the results.  At the time, the ProTiming
tool was being run using the lsidelay generated SDF and the provided SPEF
file.  For each gate in the design, the SDF is used to determine the
original gate delay.  Once a gate is resized, PrimeTime recalculates the
gate delay using the SPEF network for the net.  In my case, this led to
a situation where the gate delays in the optimized design were generated
using a mixed delay prediction; PrimeTime for the resized gates, lsidelay
for the gates that had not been resized.  I thought this would confuse the
comparison results so I asked Prolific to re-run optimization using only
the SPEF file (this request was made 10/28/2002).  All pre-optimization
to post-optimization comparisons from here forward would be made using the
PrimeTime delay prediction.

November 5, 2002

I received a third set of results from Prolific.  The speed-up as reported
by Prolific was 11.9% (the optimizer had been setup to run with no targeted
performance goal).  I reviewed the results in PrimeTime to verify the
speeds.  While I was able to replicate the number Prolific had used to
calculate the performance boost, I was not satisfied with the result.  The
speed increase had been calculated counting all types of paths; input to
register, register to register and register to output.  Obviously, I wanted
the input and output paths to be optimized, but the performance of the
processor block had to be calculated considering only register to register
paths.  When I recalculated the performance benefit using this criteria, the
speed of the processor had improved by only 4.6%.  Back to the drawing board.
I contacted Prolific to discuss the discrepancy in performance calculations
(call was made 11/07/2002).  During the discussion, I agreed to take another
pass through ProTiming using the current optimization result as the starting
point.  Additionally, I agreed to zeroing out the input and output delays so
that the ProTiming tool could focus on the flop-to-flop paths.

November 19, 2002

I received a fourth set of results from Prolific.  The speed-up was reported
to be 9.6%.  Using PrimeTime, I reviewed the results and was able to verify
that the input/output path timing was nearly identical to the 11/05/02
result.  I also verified that the flop-to-flop performance had increased by
9.6%.   At this point, I was satisfied that the result warranted further
dissection.

In order to feel reasonably confident about the result, I wanted to answer
three questions:

  1) was the timing estimation for the in-between drives solid?
  2) was the resulting area reasonable?
  3) did the cell optimizations make sense?

To answer question one, I sent our internal library development team a copy
of an uncompiled version of the estimated.db file.  (I had asked for this
prior to receiving the 11/19/02 result.)  I asked the library team to review
the timing estimation for the in-between drives and compare it to what we
would get if we were to build and characterize the cell ourselves.  After
reviewing the data, it was determined that the timing estimation method was
reasonably accurate and tended to be slightly conservative for most (but not
all) cells.  Secondly, the area penalty was determined to be 5.1% using the
Synopsys reported area.  Finally, several spreadsheets were generated with
Excel to breakdown the cell swaps that ProTiming was implementing.  One
would expect that the critical paths in the design would be composed
primarily of simple logic gates, i.e., ND2, NR2, INV, BUF, etc.  It would
seem logical, then, that the majority of ECO operations should be ECOs of
these simple gate types.  The spreadsheets showed that approximately 70% of
the cell swaps were performed on simple logic gates.  This was one
particular example of a few thought experiments that showed the cell swaps
performed by ProTiming to be reasonable.  All in all, the ProTiming results
seemed reasonable after further scrutiny.  This meant that the true speed-up
boiled down to one's ability to successfully work the changes in the P&R
flow (isn't this often the case?).  

January 5, 2003

I asked Prolific to re-run the optimization to reduce the area penalty.
They suggested an approach that would limit the maximum area (Synopsys
area attribute) of the in-between drives.  We decided that two runs of
ProTiming would be performed, one with a maximum area of 50 and one with
a maximum area of 100.

January 23, 2003

Prolific sent an email with the results from the two additional trials.
The first run (max. area 50) provided an 8.9% speed increase for a 3.6%
area penalty.  The second run (max. area 100) provided a 9.3% speed-up
for a 4.2% area penalty.

February 20, 2003

Finally had enough time on my hands to dig into the results reported in
the 01/23/03 email.  Unpacked the tarball, fired up PrimeTime and analyzed
the results.  I was able to reproduce the reported timing improvements
and area increases.


My overall observations in the ProTiming eval:

  1. I chose to skip the physical ECO process for now since this would
     require a build of the cells to generate physical views.
  2. Since ProTiming does rely on an ECO process, it has the same
     pitfalls as any other ECO process.
  3. With an ECO tool like this, one must plan ahead to reserve enough
     room to facilitate a successful ECO.  This tool will have a
     difficult time improving timing on a very highly utilized design.
  4. I like the fact that ProTiming bolts onto PrimeTime and uses a
     standard STA setup as a starting point.  Those fluent in PrimeTime
     likely would come up to speed on ProTiming in short order.
  5. Since ProTiming only does resizing optimizations, it can provide
     only marginal benefit in some cases.  For example, it will have
     minimal impact on paths hindered by bad clock skews and/or poorly
     structured logic.
  6. If PrimeTime is not your delay prediction engine, the correlation
     issue must be addressed.
  7. Whether it be in-house capability or external capability, you need
     to have a means of building and characterizing the new cells
     (in-between drives).
  8. I don't have a good handle on ProTiming's capacity issues since
     the ARM processor block was on the small side.
  9. You can do a lot of exploratory work in PrimeTime before the physical
     database is touched.

John, obviously I can't comment on any of the software-specific issues since
I did not drive the tool, but I have spent a fair amount of time reviewing
ProTiming's optimization capabilities and am generally pleased with their
test case results.

    - David Parker
      LSI Logic                                  Bloomington, MN


( ESNUG 410 Item 2 ) -------------------------------------------- [04/02/03]

Subject: ( ESNUG 409 #8 ) Magma, Cadence/Celestry ClockWise & Useful Skew

> In my opinion, useful skew is the way to go!  My question is how many
> people are there today using useful skew in their design flow today?  Am
> I one of few using useful skew or one of the many?
>
>     - Simon Matthews
>       Paxonet                                    Fremont, CA


From: Jack Fishburn <soldier=jfishburn  army=ieee.org>
To: Simon Matthews <cook=simon  kitchen=paxonet taut fawn>

Hi Simon,

I read your posting in ESNUG about useful skew.  You might want to read

            http://courses.ece.uiuc.edu/ece482/Sup/Clock3/

which is my paper "Clock Skew Optimization" in IEEE Transactions On
Computers which appeared back in 1990.  Wayne Dai's student, Joe Xi, then
did a PhD on clock distribution networks for useful skew, and his program
became "Clockwise" at Ultima, which was bought by Celestry, which was bought
by Cadence.  Then others, such as Magma, also introduced useful skew tools.

The one thing I missed in my paper was naming it something like "useful
skew", as a result of which people kept thinking I was somehow minimizing
clock skew.  But I'm pleased that it is catching on.  Back in 1990 people
would look at me like I was crazy.

    - Jack Fishburn
      ex-Agere (Bell Labs) and looking           Murray Hill, NJ

         ----    ----    ----    ----    ----    ----   ----

From: Simon Matthews <cook=simon  kitchen=paxonet taut fawn>
To: Jack Fishburn <soldier=jfishburn  army=ieee.org>

Hi, Jack,

I was aware that Ultima had a solution for useful clock skew.  However, as
a point tool, I viewed it as something that only a few early adopters would
use.

Using the tool in Magma, it gives excellent results, yet, from discussions
with other Magma users, I know that some users disable the feature!  I
believe they disable it because they really don't understand the concept;
or they have learned how to minimize clock skew and they don't want that
hard-earned skill to be redundant.

One person though did express the idea that useful skew could make on-chip
timing variations worse.  Do you have any thoughts on this?

    - Simon Matthews
      Paxonet                                    Fremont, CA

         ----    ----    ----    ----    ----    ----   ----

From: Jack Fishburn <soldier=jfishburn  army=ieee.org>
To: Simon Matthews <cook=simon  kitchen=paxonet taut fawn>

Hi, Simon,

All transistors and wires have variations in their properties that cause
timing variations.  There's no reason that a useful-skew configuration will
have more, or less, timing variation than a zero-skew configuration.

However, useful-skew does give additional degrees of design freedom that can
be useful to avoid failure due to timing variation.  In the example you gave
in the ESNUG posting, the short path is close to giving a hold violation if
zero skew is used:

                A                      B                      C
             -------                -------                -------
          ---|D   Q|-- slow logic --|D   Q|-- fast logic --|D   Q|---
             |     |                |     |                |     |
             |     |                |     |                |     |
   clock ----|>    |   --|>o-|>o----|>    |            ----|>    |
           | -------   |            -------           |    -------
           |           |                              |
           --------------------------------------------


A little bit of unintended clock delay to the destination FF "C" will cause
the path to fail by double clocking.  But by supplying extra clock delay to
the middle FF "B", as in your example, the short path is given as much extra
margin against the failure.

Another way to think of it is that useful skew can be used to give extra
safety margin at the same clock period, or lower the clock period with
the same safety margin.  Or you can dial any combination in between.

At any rate the static timing analyzer knows about the timing bounds in
both the data & clock networks, and will take these into account in setting
the intentional clock skew.  At the end of the day, the intentional skew
circuit will be safer than the zero skew circuit.  But I am painfully aware
that people who are scared will tend to reject any new technique that they
don't understand.

    - Jack Fishburn
      ex-Agere (Bell Labs) and looking           Murray Hill, NJ


( ESNUG 410 Item 3 ) -------------------------------------------- [04/02/03]

Subject: ( ESNUG 404 #15 ) Async Resets Are A Nightmare At 0.13 um & Below

> I would like to hear from anyone with experience in 0.13 um and below on
> the reset synchronous vs. asynchronous issue.  In the past this was purely
> a religious debate but now with signal integrity issues, momentum toward
> synchronous global resets is building.  Several articles have been
> published, but I have not heard from anyone with a horror story about
> glitches causing unwanted resets of flops.  What about the output of the
> signal integrity tools?  Have glitches above threshold been reported?  How
> many?  Bottom line: is this a manageable problem in today's tools or
> something one should just avoid altogether?  Does anyone recommend
> re-writing legacy code?
>
>     - John Gyurek
>       Thomson Consumer Electronics               Indianapolis, IN


From: Prasad Mantri <ship=mantri  harbor=bww lot bomb>

Hi John,

The issue is really not glitches in silicon reset logic.  Most design flows
require noise analysis to be done to fix the noise issues and the timing
pushout due to noise.  The issue is when you have a asynchronous reset it
has to meet timing to reset the part correctly.

Say you have a completely asynchronous reset on the chip which is taken
from a pin and drives asynchronous reset of flip-flops.  If the reset is
not taken away with sufficient margin w.r.t. to the clock.  Some flip-flops
will reset in the current clock cycle while some flip-flops will not reset
in the current clock cycle.  This will prevent the part from comimg out of
reset correctly.

There are two kinds of reset schemes followed in designs today.  There's
one where there are no asynchrnous reset (or set) flip-flops in the design,
which is the Synchronous reset scheme that John Gyurek is refering to.
The second one is designs which  have asynchronous reset flip-flops in the
design.  Most chips even in the second scheme get the reset deassert
syncronized as soon as it comes into the chip.  What that does is that it
gives a clock cycle for the dessert to take effect.  If you donot do this
then you need the reset tree to have timing as tight as clock tree and thus
requiring a Clock Tree Synthesis (CTS) for reset.  In synchronous reset,
the reset is treated as any other logic signal.

The advantage of synchronous reset is that in DFT no special care has to be
taken for this reset.  In asynchroonous schemes you have to disable reset
during scan shifting.  The disadvantage of synchronous reset is that you
have to have a clock to reset the part.

Gyurek asks about legacy code: fix it only if it breaks on scan or timing.
I have seen PCI IP cores and things like USB IP cores having asynchronous
reset of flip-flops being driven by other flip-flops that needed to get
fixed.

    - Prasad Mantri                              San Jose, CA

         ----    ----    ----    ----    ----    ----   ----

From: Paul Rodman <person=rodman  company=reshape caught guam>

Hi John,

I'd like to add my two cents on some physical design issues looming larger
with the use asynchronous resets.

As you know, at .18 um and below, crosstalk is becoming quite significant.
For a normal synchronous signals, timing "push-out" can occur due to a
glitch propagating down the logic cone to the input of a flip flop, if it
happens to arrive right at the setup time for the flop.  

But... for asynchronous signals a glitch is going to destroy state!
Turning down the clock will not help.  Also consider this: the window of
vulnerability is the entire cycle, rather than the setup time of the flop.
Hence the statistics are:

               cycle time/flop setup time = big number

And this is big number times worse than the timing push out problem.

The wholesale use of asynchronous resets creates a huge amount of wire that
is vulnerable to glitch pickup.  If you have multiple clock domains on a
chip, and these domains are asynchronous to each other and placed and routed
together...  well...  think about it.  The clocks are going to riffle with
respect to each other at light-speed and you are going to fire each clock
domain at every possible random alignment to the other domain.  This
happening every few seconds.  Now, build a few 100,000 systems with these
chips, subject them to different voltages and temperatures.

If you have a coupling case, you will find it.  The hard way.  :( 

So, how to safeguard against this if you really need asynchronous resets?
As Jay Pragasam pointed out in ESNUG 396 #1, the typical novice boo-boo with
resets is forgetting that they need to be considered timed signals in order
to have the chip exit reset cleanly.

I'll go one step further: logic cones feeding asynchronous resets must have
every net glitch free (or very assuredly non-propagating), for any worst
case potential coupling event. The only practical way to achieve this (and
not lose any sleep at night) is to aggressively buffer such nets.

This is what we do in our flow here at ReShape.  We allow the use of
asynchronous resets, but protect and check them.  This comes at some
additional cost in routing (potentially chip area), but we feel it is
not something to be skimped on.

You can't really use switch windowing to help reduce the number cases
detected either, even with a chip with only one clock domain.  The reason is
that none of the tools I've seen model the variation in process for silicon
and metal independently.  The delay of aggressor nets is a function of both
silicon and metal delay.

This uncertainty in the relative delay of aggressors nets with respect to
each other means that as you move through the cycle the switch windows must
become more and more broad. Ironically, for a long-tick ASIC with lots of
gates switch windows have almost no use.

And of course if you do have two or more async clocks on the chip routed
together, you are already unable to use windows.


As for reset fanout, even with synchronous resets at high clock rates it is
not trivial to fanout to hundreds of thousands of flops in once cycle.

Since there is usually no latency constraint on reset, a solution can be to
do some "synchronous fanout" of the reset tree.  What you do is make the top
two or three levels of the fanout tree out of registers rather than buffers.
Thus, a couple of levels down the tree you have enough flops that each block
in a hierarchical design (or quadrant in a flat design) has a local flop
driving the rest of the tree.

The huge one-cycle fanout problem is now reduced down to a few 10's of
thousands of flops.


In the world of .13 um and below, it's worth the effort to wean oneself
off of wholesale asynchronous reset methodologies.

    - Paul Rodman
      ReShape, Inc.                              Mountain View, CA


( ESNUG 410 Item 4 ) -------------------------------------------- [04/02/03]

Subject: ( ESNUG 407 #16 ) Mentor Changing The Eldo Licensing Mechanism

> We've been using Mentor's electrical simulator "Eldo" fo a long time now.
> Until recently (version 5.7), a user could run several Eldo simulation
> jobs simultanuously on a given machine using a single key - one key par
> (user+machine).  I call this a user-based licensing mechanism.
>
> In the newer versions of the Eldo simulator (version 5.8), each simulation
> job checks-out a key, even if it is the same user on the same hostid.
> I call this a job-based licensing mechanism.
>
> This change is very important as this leads to buying as many licenses as
> you run jobs in parallel compared to the previous mode where you needed
> one license per user of the tool.  The first licensing mode allowed users
> to take advantage of multiprocessor workstations.
>
> The price of the simulator did not change.
>
> I have several questions now:
>
>   - Is this a license terms violation?
>   - How do other simulators (HSPICE, Spectre, etc.) behave?  User-based
>     or job-based?
>   - Is Eldo really faster than hspice?
>
> Thank you for your help.
>
>     - [ I Wear My Sunglasses At Night ]


From: [ Do You Know The Muffin Man? ]

Hi John,

please keep me anon in case you publish my answer.

Q: Is this a license violation?

It may or may not.  That depends on the signed license agreement the company
has.

Q: How do other simulators behave?

A: All other licenses I encountered are job-based licenses (to stick with the
definition below).  But Mentor may  offer something like a multiprocessor
option.  I know for sure, it is available and works for Calibre.  If you use
between 2 and 6 CPU's on one machine, then it only takes 2 licenses.

There is a table available from Mentor for Calibre, how many CPU's translate
into how many required licenses, if you want to use the multi-processors on
a single machine.  If you use a load sharing system like LSF for example,
that might get you around of buying additional licenses.  We use that and it
helps us a lot.

Q: Is Eldo really faster then Hspice?

A: I can't comment on Eldo, we don't have it.

    - [ Do You Know The Muffin Man? ]

         ----    ----    ----    ----    ----    ----   ----

From: [ Intel Inside ]

John,

Please keep my name anon (I am happy to entertain questions)

* License: User-based or job-based:

Unfortunately, all the simulators I have used (including Eldo, Hspice and
Spectre) use the job-based licensing scheme even if it is used by same user
on same machine.  The job-based is the adopted method since the move from
node-locked to floating (network) scheme.  As a user, I wish that a single
license token is taken if same user is running any number of jobs on same
machine and all the jobs are submitted from one machine.  This may be a fair
compromise to prevent abuse of using one user account to run many jobs while
allowing same user the freedom to use the h/w efficiently without the
rip-off from (ALL) CAD vendors.  I am looking forward to see some action
taken to awake the user community in order to change the current scheme and
make it more equitable and fair to both parties.

* Speed of Eldo, Spectre, Hspice simulators:

It is hard to make a blank statement that one simulator is X-times faster
than the other.  There are many factors that control the speed of a
simulator and the speed/accuracy trade-off.  In general, I may say that for
CMOS design I found both Eldo and Spectre to be significantly faster than
Hspice with the equivalent accuracy setting(!).  In addition, I experienced
even faster speed with Eldo, especially for large circuits with
proportionally large digital parts.  Eldo have efficient techniques that
take advantage of the nature of the circuit and the signal flow without
sacrificing accuracy.

    - [ Intel Inside ]

         ----    ----    ----    ----    ----    ----   ----

From: David Norris <wolf=david.norris  wolfpack=legerity thought dawn>

Hi, John,

Yesterday we met with out Mentor AE and he described a new feature in Eldo
6.0, .mprun.  This feature allowed Eldo to submit multiple jobs from the
same deck and only use one license.  .mprun also can use LSF to access
queues that have been set up.  So, it could be that Eldo's old mechanism
for job submission has changed, but I believe that the principle of multiple
job submission with one license will still exist.

At this point we have just been introduced to this feature, so we don't
have many details.

    - David Norris
      Legerity                                   Austin, TX


( ESNUG 410 Item 5 ) -------------------------------------------- [04/02/03]

From: Nir Sever <king=nir  castle=zoran.co.il>
Subject: What Are The Speed & Power Advantages of Going From 0.18 To 0.13?

Hi John,

I would like to get your reader's "rules of thumb" as to what speed
increase and power saving one might expect when porting a design from
a 0.18 um process to 0.13 um, assuming the same fab, same process type
and same type of libraries.

    - Nir Sever
      Zoran                                      Israel


( ESNUG 410 Item 6 ) -------------------------------------------- [04/02/03]

Subject: ( ESNUG 409 #6 ) Novas Calls Shenanigans On Veritools Potshots

> Undertow Suite can do exactly what Novas does and for less than 1/4 the
> cost.  What a totally unnecessary and major financial burden it is to have
> to purchase tools for every engineering team member that cost $25,000+,
> just in order to trace signals back on your RTL or gate designs.
>
>     - Bob Schopmeyer
>       Veritools, Inc.


From: Lorie Bowlby <director=lorie  orchestra=novas blot hon>

Hi, John,

Bob Schopmeyer of Veritools once again is quoted in relation to Novas.  Now
in the interest of free speech, we are understanding when another EDA vendor
wants to take potshots at our company, but I have a hard time with the fact
that he'd use ESNUG to disseminate erroneous information.  Bob is playing up
the fallacy that Debussy is just a waveform viewer, which at that basic
level, competes with his Undertow.  Okay, we can handle that - it's a sales
tactic.   But the fact remains that Debussy goes for around $5k TBL, not the
$25k Bob claims.  We announced last year a new product, Verdi that is the
first behavior-based debug system and includes everything Debussy has and
much more and that one lists at $14k.

In these times when our customers are struggling with tough decisions as to
what tools to buy with their diminished budgets, this kind of disinformation
is not helpful to the customer.  So my question is do you have any
guidelines for this type of situation?  My president, Scott Sandler, thinks
I should let Bob's comment go since our customers know the truth.  But Bob's
lies about our pricing bothers me.

    - Lorie Bowlby
      Novas Software                             San Jose, CA


( ESNUG 410 Item 7 ) -------------------------------------------- [04/02/03]

From: Dan Joyce <captain=dan.joyce  ship=hp naught pomme>
Subject: Dan's First Place DVcon'03 Paper On How To Reduce Random Testing

Hi, John,

As per your request, here's a copy of my DVcon'03 paper.  I could have
easily titled it "Audit Your Design to Avoid the Random Testing Nightmare".
Basically it demonstrates a way to formalize the approach to corner case
testing.  The paper describes an audit of a design to try to assess the
amount of risk to corner case bugs.

Essentially, it'll teach your readers how to look at their design's
architecture to cut back the amount of random testing needed to verify it.

I've included my slides, too.  I'd love to see what your readers think
about this approach we've developed here at HP on this.

    - Dan Joyce
      Hewlett-Packard                            Austin, TX


  [ Editor's Note: Dan's paper is #42 of DeepChip Downloads  - John ]


( ESNUG 410 Item 8 ) -------------------------------------------- [04/02/03]

From: Rajesh Bawankule <the_one=rbawanku  the_many=cisco pot yawn>
Subject: Rajesh's DVCon'03 Paper On Free Ways To Speed Up Your Verilog Runs

Hi John,

Here's my paper & slides titled "Speed Up Verilog Simulations By 10-100X
Without Spending A Penny" from DVCon 2003.  My presentation illustrates
a number of ways you can speed up simulation, without spending anything
on a faster simulator, workstation, hardware accelerator, or emulator.
I focus only on already existing simulator features like profiling, 
optimized compilations, 2-state simulations, adaptive PLIs etc.  Included
are some case studies from my previous projects about typical bottlenecks
in simulation.

Nothing is free.  My tricks won't work like magic.  One need to use them
step by step making sure that their existing test benches don't go
haywire, though!   :)

I hope that ESNUG readers find it useful.

    - Rajesh Bawankule
      Cisco Systems                              San Jose, CA


  [ Editor's Note: Rajesh's paper is #43 of DeepChip Downloads  - John ]


( ESNUG 410 Item 9 ) -------------------------------------------- [04/02/03]

Subject: ( ESNUG 408 #1 ) Why Astro-Xtalk & HSIM Results Will Always Differ

> The HSIM simulations that we ran contributed to the decision that our
> noise model, although conservative, was too pessimistic.  We modified
> our noise model which improved our Astro results.
>
>     - Roger Boates
>       STMicroelectronics                         San Diego, CA


From: Li-Pen Yuan <actor=lipen  movie=synopsys ought alm>

Hi, John,

The difference in results Roger Boates reported in Astro-Xtalk and Nassda
HSIM comes from:

   1. Fundamental difference in nature between static and dynamic analysis.
      Astro-Xtalk determines which aggressors switch simultaneously based on
      the overlap of timing windows, obtained from the STA.  Due to its
      static nature, the timing window overlap is more "exclusive" than
      "inclusive."

      That is, when two aggressors have disjoint timing windows, we know for
      sure they will not be switching at the same time, discounting process
      variation effects, and therefore their contributions should be
      assessed separately.  However, that is very different from saying they
      will *always* be switching at the same time.  Static analysis only
      tells you such *possibility* exists.

      To prove it one way or another is technically feasible, but
      prohibitively expensive.  So we don't do it in Astro-Xtalk.  There's a
      trade-off between how much pessimism can be removed from static
      analysis vs. how much runtime hit user is willing to tolerate.  For
      static crosstalk analysis, we decided that reduction in pessimism via
      the use of timing windows is acceptable but to avoid anything more
      complicated.  An analogy is user-entered false path in today's STA vs.
      the concept of automatic false path detection, a popular research
      topic that wasn't embraced by the industry.

   2. Insufficient simulation cycles.  Roger didn't specify how many cycles
      he ran vs. how many different input patterns are possible.  It's
      possible that his simulated patterns didn't catch the worst-case
      behavior of the net where more than one aggressor switches
      simulatenously.  Looking at the long runtime incurred in simluation-
      based techniques, it's obvious that for implementation and sign-off
      static analysis is really the only feasible solution with reasonable
      turn-around time.

   3. Roger's runs lacked wire resistances.  Since wire resistance is not
      back-annotated for the sake of runtime in his experiment, interconnect
      delay and slew degradation are essentially discarded from his
      simulation.  It is then possible that his differences in signal
      arrival times for the three aggressors are large enough that they no
      longer switch at the same time (while in the reality they will, had
      the parasitics been fully annotated.)

From the above it should be obvious that both static and dynamic analysis
are limited in applicable scenarios.  In fact, they're complementary.
Static analysis covers the worst-case scenarios and runs orders of magnitude
faster than dynamic analysis.  It however can be overly pessimistic and
results in Astro P&R overkilling the problem.

Dynamic analysis uncovers the true worst-case behavior, but at a very high
runtime cost (which limits its uses significantly.)

Roger's letter actually shows how difficult it is to get your static and
dynamic analysis to run under the same setting and with the same input data.
Here's a more popular validation procedure when a static crosstalk analysis
tool is used:

   a. A set of simple test circuits are designed by the evaluator.
   b. Both static and dynamic tools are used to run through
        the test circuits and produce results.
   c. Correlation in results from the two tools are measured and decision
        of adoption is made based on this figure of merit.

Please note that the designer's "simple test circuits" are usually designed
in such a way that the differences in static & dynamic analysis tools are no
longer important.  That is, the worst-case scenario reported by a static
tool can be verified by running a dynamic tool with one pattern.  This
limited type of test therefore only evaluates the accuracy of crosstalk
modeling (including driver and interconnect) not the degree of pessimism
brought up by using timing windows.  This is instead of using simulation
traces to decide how many aggressors can be *actual* culprits in real
circuit operations.

    - Li-Pen Yuan
      Synopsys, Inc.                             Sunnyvale, CA


( ESNUG 410 Item 10 ) ------------------------------------------- [04/02/03]

From: David Sawey <prisoner=dsawey  jail=vitesse tot sawn>
Subject: Is Vera/VCS Performance Better Than Vera/NC-Sim Performance?

Hi John,

Our Synopsys AC is saying that we can improve our Vera simulation run times
by a factor of between 1.8X and 6X by using Vera/VCS instead of Vera/NC-Sim.
The reason is that VCS implements a direct kernel interface whereas Cadence
NC-Sim uses the slower PLI interface.  However, when we run our testcases,
we find that the two approaches are actually quite competitive.  What's
going on?  Are we missing out on some performance increases that everyone
else is getting?

    - David Sawey
      Vitesse Semiconductor Corp.                Richardson, TX


( ESNUG 410 Item 11 ) ------------------------------------------- [04/02/03]

Subject: ( ESNUG 393 #4 ) Only Fools & Idiots Use #1 Delays In Their Code

> In ESNUG 387 #16, Mark Warren claims that the following flip-flop coding
> style uses an unnecessary #1 delay:
>
>                         always @(posedge clk)
>                             a <= #1 b;
>
> If you do not include the #1 delay above, then you are relying on the
> intra-timetick event ordering rules of Verilog.  The problem is that
> the PLI interface does not guarantee these intra-timetick ordering
> rules.  Thus, if you use PLI to model circuit behavior, you may have
> latent race conditions.  (This applies to VMC models, too)
>
> In addition, using the #1 does in fact avoid races when you have code
> which may not always use non-blocking assignments for flops.  This can
> happen if a designer accidentally uses a blocking assignment, or if
> you are trying to do mixed gate+RTL simulations, or if you are using
> code not produced by your team - for instance compiled RAM models, 3rd
> party BIST controllers, IP cores, etc.
>
> Furthermore, if you eliminate all #1 delays on flops and then have a
> race condition, it is notoriously difficult to debug.  Just turning on 
> waveform dumping may cause the simulation to pass...
>
>    - Darren Jones
>      MIPS Technologies                          Mountain View, CA
>


From: Mark Curry <goldfish=mcurry  bowl=ti jot prom>

John,

Using #delays ANYWHERE in RTL code is not allowed in my coding style.   I
just went back and checked my code for the last three chips designs.  No
#delays anywhere.  Not a one.  And now since we're starting to code
synthesizable testbenches (for use in hardware acceleration/emulation), the
#delays are disappearing there, too.  My only #delays is at the top level
testbench for generating the clock.  That's it.

At best case, #delays can only offer a clutch for those to hold when viewing
waveforms.  It allows them to "see" the clk-Q delay of a flip-flop.  

At worst case it can introduce real bugs in designs.

Somewhere in the middle - you're just creating more problems for folks down
the line trying to reuse your code, or use a different simulator.


A couple of counterpoints to Darren's arguements for #delays:

> 
>                        always @(posedge clk)
>                             a <= #1 b;
>
> If you do not include the #1 delay above, then you are relying on the
> intra-timetick event ordering rules of Verilog.  

These event ordering rules are well defined rules that all simulators must
follow.  Of course I rely on them.  So do you.  Better to rely on explicit
well defined behaviours rather than random ones which change with any change
in code or simulator.


> The problem is that the PLI interface does not guarantee these intra-
> timetick ordering rules.

Huh?  We use the PLI extensively, and have never had issues with the lack of
#1s.  Last time I looked, when driving Verilog objects from the PLI you're
given many options on when the PLI's "drive" takes place.  Effectively, we
had the C model drive our Verilog in the same style as our Verilog non-
blocking type assignments to clocked registers (without delays).


> In addition, using the #1 does in fact avoid races when you have code
> which may not always use non-blocking assignments for flops.

This is a bug.  Period.  See Cliff's paper in ESNUG 409 #10.  If it's your
code fix it.  If it's in someone else try and get them to fix it.  If not,
then you're forced to work around it.  Which kind of leads to my original
point.  If the original person had never used the delays and other dangerous
coding styles in the first place it wouldn't cause problems later on for
someone reusing the code.


> Furthermore, if you eliminate all #1 delays on flops and then have a
> race condition, it is notoriously difficult to debug.  Just turning on 
> waveform dumping may cause the simulation to pass...

If you have code that fails when dumping is off, then passes when dumping is
on, it is because you're not following the blocking/non-blocking rules, and
should fix it.   It's not because of the lackof #1's.  The #1's are simply
poor work-arounds.  

Our team has now followed the recommended blocking/non-blocking coding
styles for quite a few years.  Some still do use the #1 in non-blocking
assigns for flip-flops.  It's OK as long as they follow the blocking/non-
blocking rules, and don't mind seeing 1 nsec glitches in their sim dumps
(where signals from #1's and non-#1's come together through non-sequential
logic.)    

I'm still working on them though.  We'll convert them over yet!  #1's
really offer you nothing.  And have the potential of causing soo
much grief.

We are able to simulate our designs in many sim environments including:
Verilog-XL, NC-SIM, ModelSim, and Axis XSIM (both software and hardware
acceleration).  Other teams take our designs an map it onto Quickturn boxes.
All have no issues (with respect to this coding style).  We're able to
switch back and forth with just changes to run scripts.  

    - Mark Curry
      Texas Instruments Broadband Access Group 

         ----    ----    ----    ----    ----    ----   ----

> In addition, using the #1 does in fact avoid races when you have code
> which may not always use non-blocking assignments for flops.  This can
> happen if a designer accidentally uses a blocking assignment, or if
> you are trying to do mixed gate+RTL simulations, or if you are using
> code not produced by your team - for instance compiled RAM models, 3rd
> party BIST controllers, IP cores, etc.


From: Cliff Cummings <dancer=cliffc  nightclub=sunburst-design rot mom>

Hi, John,

There are a few problems with Darren's statement.

  1) mixed blocking and nonblocking assignments are a problem with or
     without the #1 delay.  It is possible for a blocking-assignment flop
     to execute before the right-hand-side (RHS) of a nonblocking assignment
     is sampled, causing a race condition.  If there are flops modeled with
     blocking assignments in the design, they should be fixed, period.

  2) semi-correct on the mixed gate+RTL simulations, but not completely 
     correct.  Most gate-level register models have hold time requirements
     of less than 1 nsec so the #1 will fix those problems.  It's not
     unusual for a RAM model to have a hold time requirement of greater
     than 1nsec, in which case the #1 is not sufficient.

The only place you need an intra-assignment delay included with a non-
blocking assignment for mixed gate+RTL delays is on the outputs of the RTL
model that could drive the inputs of a gate-level model with hold-time 
requirements.  You also need to verify that all of the gate-level inputs 
have hold time requirements of less than 1nsec, otherwise some of the RHS 
delays will need to be increased.  A semi-simple ifdef-ed block of code to 
drive all module outputs with the typical #1 delay could be added to the 
module to address this rare condition, otherwise, running regressions on 
large RTL designs would benefit from the accelerations described by Mark.


> Furthermore, if you eliminate all #1 delays on flops and then have a
> race condition, it is notoriously difficult to debug.  Just turning on
> waveform dumping may cause the simulation to pass...

Race conditions can be avoided by following some simple guidelines.  See my
updated paper in ESNUG 409 #10.

The idea of adding #1 delays to see the races is the wrong emphasis.  Avoid
the races, period.  The recommended guidelines in the pending IEEE Verilog
Synthesis  Interoperability Standard were included to help engineers avoid
the race conditions.

Yes, adding #1 can be useful for waveform displays, as noted.  It does make
it easier to see the clk-to-q output delays as opposed to viewing the clock 
edge and transition at the same time.  Even this is not bad if you avoid 
mixing blocking assignments and nonblocking assignments in the same 
procedural blocks, which would cause some of the waveform inputs to also 
transition on the clock edge.  Most confusing (and should be avoided).

Under no circumstances should delays be added to either the left or right 
side of blocking assignments in an RTL model.  This can be the source of 
missed events and simulations that do not behave like any real hardware in 
existence.


> If you don't use PLI, and have bug-free RTL and all of it follows the
> given style recommendations, the #1 is not needed.  However, using #1
> does give your design some degree of resiliency to the many coding
> styles in use today.  I do not dispute that VCS may run faster, but it
> doesn't help me to have a fast simulation that doesn't work.  :)

I typically find that adding the #1 to all nonblocking assignments is 
rooted in misunderstanding the Verilog scheduling queue, paranoia and poor
adherence to following proven guidelines to avoid Verilog race conditions.
This is my biggest complaint about most Verilog training courses, that they
do not cover nonblocking assignments and guidelines well and so course 
graduates are now among the semi-educated dangerous.

I will gladly enforce a few simple guidelines to get non-race simulations 
that run vcs +rad fast.  Thanks, Mark for sharing the "under the covers" 
info about VCS.

    - Cliff Cummings
      Sunburst Design, Inc.                      Beaverton, OR


============================================================================
 Trying to figure out a Synopsys bug?  Want to hear how 16,683 other users
  dealt with it?  Then join the E-Mail Synopsys Users Group (ESNUG)!
 
     !!!     "It's not a BUG,               jcooley@TheWorld.com
    /o o\  /  it's a FEATURE!"                 (508) 429-4357
   (  >  )
    \ - /     - John Cooley, EDA & ASIC Design Consultant in Synopsys,
    _] [_         Verilog, VHDL and numerous Design Methodologies.

    Holliston Poor Farm, P.O. Box 6222, Holliston, MA  01746-6222
  Legal Disclaimer: "As always, anything said here is only opinion."
 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)