Q: How does an Avanti engineer get timing closure?

  A: He asks for probation.

      - a joke I heard at the Las Vegas DAC Cadence party last week


( ESNUG 375 Subjects ) ------------------------------------------ [06/28/01]

Item  1 :  How Can I Extract A Hierarchical Module From A Flat Gate Netlist?
Item  2 : ( ESNUG 374 #6 ) Avanti Has LEF/DEF; They Just Won't Tell You It
Item  3 :  I Already Have Raphael; Should I Switch To Raphael NES (Quickcap)?
Item  4 :  The Synopsys Formality Folks Supported Us When Verplex Ignored Us
Item  5 : ( ESNUG 374 #1 ) "Jealous" Analysts Made Those Anti-Magma Comments
Item  6 :  PrimeTime Reports "Ignored Exceptions"; Just Not WHY They Occurred
Item  7 : ( ESNUG 374 #4 ) What's The User Dirt On Electromigration Tools?
Item  8 :  Free Sun GridWare Vs. Paying For Platform Computing's LSF ???
Item  9 : ( ESNUG 372 #10 ) Verplex Vs. Formality "Data" Very Misleading
Item 10 :  Are You Thinking Of Going Avanti Astro Or Sticking With Apollo II?
Item 11 :  Linux Is Lousy For Servers & Get2chip's Linux vs. Sun Benchmarks
Item 12 : ( ESNUG 374 #3 ) CynLib C Benchmarks 2X To 5X Faster Than SystemC
Item 13 : ( ESNUG 374 #3 ) The Emperor Strikes Back On His SystemC Benchmark
Item 14 : ( ESNUG 373 #6 ) DC 99.10-6 Putting 2 Inverters In My Reset Paths!

 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com

( ESNUG 375 Item 1 ) -------------------------------------------- [06/28/01]

From: Jeff Winston <jeff.winston@mindspeed.com>
Subject: How Can I Extract A Hierarchical Module From A Flat Gate Netlist?

Hi, John,

When we do minor (typically metal-only) re-spins of our devices.  We make
our changes by hand at the gate level.  We've gotten good at specifying
the gate changes, but have run into a roadblock in verifying them.  In the
old days, our gate level netlists preserved hierarchy, and so it was easy
to extract the module containing the fix and either use it to replace the
RTL block in simulation, or (more recently) re-synth the RTL and
Formality-check the synth'd gates to the hand-fixed gates.  However, things
like clock-tree-insertion and scan-reordering are now causing our gate-level
netlists to be totally flat.  Extracting a block of gates that's
bristle-equivalent to an RTL block has become a non-trivial task.  I was
wondering if anyone out there had licked this problem?

    - Jeff Winston
      Mindspeed Technologies


( ESNUG 375 Item 2 ) -------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #6 ) Avanti Has LEF/DEF; They Just Won't Tell You It

> Once I came to my present employer, I couldn't remember the "getClash"
> command.  I knew it existed but had lost my notes.  So I asked my AE at
> Avanti to help me out.  His response was that I had to ask the salesperson
> about anything regarding LEF/DEF.  So I asked him, and his most helpful
> response was:
> 
>    "It's good to hear from you again.  Unfortunately, we do not support
>     LEF/DEF.  We only support Verilog and GDS II.  Are you using Apollo?
>     Have you evaluated Saturn (our physical optimization)?
> 
> Yeah... I've seen Saturn in action.  Or is that "inaction"?  ...
>
> Apollo used to output DEF based upon the CEL view, not the FRAM view.  No
> clue why, but I have seen CEL views that didn't match the FRAM and the DEF
> was, well, pretty useless.  This switch changes the behavior so auDefOut
> uses FRAM for coordinate calculations:
> 
>    auSetDefOutFramBndry #t
> 
> It was probably mentioned in ESNUG before, but there it is just the same.
> 
>     - Leo Butler
>       Brocade Communications Systems, Inc.


From: [ Intel Inside ]

John,

Please keep me "anonymous" if you want to use my feedback:

Avanti does have a DEF/LEF (auDefIn/auDefOut) interface but not a complete
one.  It should work with most cases (some non-standard parsing gives extra
workaround freedom while the other needs extra scheme code to workaround).  
It's not as complete as SNPS's and Cadence's; but still working for us so
far.  And, the documentation is poor or missing for this function. 
    
I heard their new DEF/LEF API is out there already; the new one should be
better than the one I use.  They did fix some DEF/LEF input bugs for us but
may put more efforts on their new interface that I have not used yet.

I still like to use text-based design files besides any efficient binary
database so that I can always debug and workaround problems.  I wish there
were a better and more popular file format for hierarchical & timing-driven
design than today's DEF/LEF.  

    - [ Intel Inside ]
 

( ESNUG 375 Item 3 ) -------------------------------------------- [06/28/01]

From: Grant Riley <grantril@us.ibm.com>
Subject: I Already Have Raphael; Should I Switch To Raphael NES (Quickcap)?

Hello, John,

I am a design kit developer for the IBM SIGE RF/Analog design kits.

I am interested in purchasing Raphael NES (Quickcap).  My group currently
has one copy of Raphael.  I know that Quickcap is most often used as a
'golden' capacitance extraction tool when comparing parasitic extraction
tool capacitance accuracy.  I currently use Raphael as our 'golden' 3D
extractor, but am thinking of replacing it with Quickcap.  I believe there
are many reasons for this switch.  Here are just a couple:

  - Quickcap offers full chip extraction for critical nets.

  - Quickcap is a full chip extraction tool.

  - Raphael works well for test structures, but for large structures the
    run times are not reasonable.

  - Raphael is more accurate than Quickcap for capacitance but the run
    times are long.

So the question is, is it worth switching to Quickcap knowing that Quickcap
is $78 K and I already have Raphael in house?

    - Grant Riley
      IBM


( ESNUG 375 Item 4 ) -------------------------------------------- [06/28/01]

From: Dharina Desai <dharina_desai@fast-chip.com>
Subject: The Synopsys Formality Folks Supported Us When Verplex Ignored Us

Hi, John,

We purchased Formality after evaluating Verplex's Tuxedo and Synopsys
Formality.  Throughout the evaluation, Formality completed all of the test
cases using a simple standard setup and the Synopsys sales-support team was
always quick to respond and constantly offered their assistance.  Verplex
showed no such alacrity and was not able to complete any of our test cases.

Synopsys's level of support during the evaluation led us to conclude that
they would remain committed to our success even after our purchase.  I'm
glad to say, they have!

During our most recent verification Formality wasn't able to match our RTL
code with our Ambit gate-level netlist.  The RTL looked good and we passed
it to Synopsys as a bug.  After several tests, they were not able to find
any error generated by Formality and netlists which were synthesized from
DC passed verification.  We began to look at Ambit and found a bug in the
way it treated a 2-stage adder!

    - Dharina Desai
      Fast-Chip, Inc.


( ESNUG 375 Item 5 ) -------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #1 ) "Jealous" Analysts Made Those Anti-Magma Comments

> I'd echo what you've already heard about Magma.  I have also heard from
> all the major EDA players that Magma was shopped heavily prior to this
> filing and that all the majors turned them down.  Two reasons cited were
> failure to penetrate the market significantly in the past two years and
> a product that had trouble managing large designs.
>
> When I look at Magma, I think it looks like the banks are trying to turn
> EDA into the next Internet bubble.  Early in the Internet craze, banks
> avoided bringing public companies with less than $18-$20 million in
> trailing twelve month sales.  As the gold rush continued the standard
> got lower and lower.  Many later deals were of lesser quality and had
> fewer sales.
>
> With $8.4 million in sales and this kind of burn rate, I'd say the
> investors who buy this thing after the large funds flip risk getting
> burned themselves.
>
>    - Jennifer Jordan
>      Sr. Research Analyst
>      Wells Fargo Van Kasper                     Portland, OR


From: Dave Chapman <dave@goldmountain.com>

Hey John,

I think that Jennifer Jordan is mistaken about a venture rush into EDA.  Six
months ago, the Sand Hill Bandits were saying that they had no interest in
doing another EDA company.  The stated reason was that, since the total EDA
market is under $20 Billion, there was not sufficient upside potential.

Given the realities of what marketing costs, I have to agree.

So, the smart money is out of this investment space.  I don't know about the
dumb money.

    - Dave Chapman
      Gold Mountain Consulting

         ----    ----    ----    ----    ----    ----   ----

From: Ravi Nair <ravi@6d.com>

John,

Chanced to see your Magma filings.  Looks like you are out to get them.  :-)

  1) Thought that the front page lead which says

       "There's not only all the tape-out data there, but also lots of
        detailed Magma bug talk and what it's like to really use Blast
        Fusion",

     was kind of misleading.  If you actually go read user responses, they
     were quite positive.  And most of the bugs were supposedly prior bugs
     which the user says were fixed in subsequent versions.

  2) The flurry of analyst comments got me thinking on what was in it for
     them to come out negatively against Magma.  All I had to do was to go
     to the CNET brokerage site

             http://investor.cnet.com/investor/brokeragecenter
          /reports-single-company/0-9910-1084-0-SNPS.html?tag=qbox

     Of course, all four of them (Merrill, WF Van Kasper, Goldman and Dain
     Rauscher) turned out to be Synopsys boosters who have put out buy
     recommendations on Synopsys since May.  All of them, like Van Kasper,
     probably makes a market on Synopsys shares.  What a biased bunch!

     The only brokerage firm missing from the list of of those who made
     recent Synopsys buy recommendations was Robertson Stephens.  But of
     course, they are underwriters for the Magma IPO!!!

           http://www.hoovers.com/co/capsule/6/0,2163,99156,00.html

So how does one take these analysts' word as face value?  Looks they are
trying to protect their hide and their clients' money.  And may be sore that
they didn't get into the IPO action?

Got to admit though that you have tried to be fair trying to paint a true
picture of current Magma tapeouts, when you could have stuck by your earlier
rules and limited the number.

Ease up on these guys, will ya?  It takes a lot of money and effort to go up
against entrenched market players.  Also, it looks like they are getting
away from so called "internet era excesses", by trimming staff and trying to
get profitable.  And you got to give it Magma for trying to tilt at the
windmills rather than going the timid route of other EDA start-ups who
restrict themselves to a niche with an exit strategy of being gobbled up by
the big three.  It at least makes the market leaders take notice and try to
improve their products - and ultimately, it is the users who benefit
tremendously from this (like Intel-AMD-Transmeta).

    - Ravi Nair
      Sixth Dimension, Inc.                      Fremont, CA

         ----    ----    ----    ----    ----    ----   ----

From: [ A Magma Employee ]

John,

Did you notice that all the negative comments from analysts about Magma were
from those who didn't get Magma business.  It sure is a case of sour grapes.

Keep me anonymous.

    - [ A Magma Employee ]


( ESNUG 375 Item 6 ) -------------------------------------------- [06/28/01]

From: [ Feeling Exceptionally Ignored ]
Subject: PrimeTime Reports "Ignored Exceptions"; Just Not WHY They Occurred

John, please keep me anon.

PrimeTime reports "ignored exceptions" yet doesn't state -WHY- the exception
was ignored.  Nor is there any capability to help debug the cause for the
ignored exception.  I'm painfully aware of all the reasons why the exception
*may* be ignored, but what would be really nice would be for PrimeTime to
give me the answer so when I push the button on a new customer's design I
won't be faced with days of work to analyze reams of ignored exceptions.

I've requested this enhancement from Synopsys yet there has been no action.
Anyone out there have a home grown solution?  Does Ambit also have this
shortcoming?

    - [ Feeling Exceptionally Ignored ]


( ESNUG 375 Item 7 ) -------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #4 ) What's The User Dirt On Electromigration Tools?

> Cadence has such a signal wire EM tool, and has had one for quite some
> time.  The Cadence tool SE-SI (Silicon Ensemble - Signal Integrity) has
> handled the signal line electromigration problem for several years.  We
> call this effect 'wire self heat', but it's the same effect.  It's also
> sometime called 'Joule heating'. ...
>
>     - Lou Scheffer
>       Cadence


From: Caesar Abedin <cabedin@amcc.com>

John,

OK, now that I've heard from every EDA vendor out there, some trying to sell
us their place and route tool and others trying to sell their extraction
tool.  It seems that there are 3 major vendors out there that have a
stand-alone tool for EM analysis on signal nets - ElectronStorm (Simplex),
Railmill (Synopsys), and Mars (Avanti).  I was wondering if the masses of
users out there have any recommendations/warnings on them.

    - Caesar M. Abedin
      Applied Micro Circuits Corp.               Andover, MA

         ----    ----    ----    ----    ----    ----   ----

From: Michael Zaslavsky <michael.zaslavsky@intel.com>

Hi John,

Avanti also has signal net analysis, however not in Star-RC-XT.  I hope
they will have it in Star-RC-XT, but for the time being I have to use
Star-RC/Star-Power for that. 

Before explaining the flow, let me say that power/ground EM verification
is also done here by combining Star-RC/Star-Power and Star-RC-XT.  This
combination allows me to do hierarchical analysis at block and chip level.
Most of my APR blocks have embedded custom blocks, which I do not extract
into to avoid capacity problem w/ transistor-level extraction/verification.
Instead, I am using macromodels generated by Star-Power during transistor-
level analysis.  In the coming weeks I'll set up Star-RC-XT at transistor
level as well.

Regarding signal nets verification, -- there is also EM part of it, not only
Self-Heating.  The fact is that even in APR designs I have C > Cmax, so I
cannot guarantee that my std cells are EM proven.  As for custom made blocks
the whole interconnect metallization should be verified wrt to EM.

Self-Heating analysis is a combination of local interconnect heating
analysis and heat dissipation.  The first is pretty much about RMS current
density, which is calculated by Star-Power only in dynamic mode.  So, I
have the flow that runs static analysis first to get approximate values for
Irms, and then filter out the false violations with some heat-dissipation
analysis.  The dynamic mode is still useful, when you are left with a
limited number of nets.  Star-Sim is much faster than HSpice, but not as
fast as static analysis.

    - Michael Zaslavsky
      Intel


( ESNUG 375 Item 8 ) -------------------------------------------- [06/28/01]

From: [ Grid And Bear It? ]
Subject: Free Sun GridWare Vs. Paying For Platform Computing's LSF ???

John, please keep me anonymous please.

Does anyone out in ESNUGland have experience with using Sun's Grid Engine
software (http://www.sun.com/gridware) for managing EDA compute jobs and 
licenses?  I used Platform Computing's Load Sharing Facility (LSF) at a
previous employer and liked it, but my penny-pinching current employer
likes the cost of GridWare: it's free.  I'm not fond of the GridWare
approach to license management and of course it only runs on Sun and Linux
platforms so far.  What other user's experiences with it?

    - [ Grid And Bear It? ]

    
( ESNUG 375 Item 9 ) -------------------------------------------- [06/28/01]

Subject: ( ESNUG 372 #10 ) Verplex Vs. Formality "Data" Very Misleading

> With respect to the gentleman from ST's comments about Formality: here's
> some live data from Verplex-LEC for a just completed design. 
>
> Design: about 44K gates with gated clocks and scan.  (There were 2917
> comparison points.)
>
>   a) Gate-to-gate comparisons run in approx 7 mins on an Ultra II-300MHz
>      with 512MB.
>
>   b) Gate-to-RTL runs in approx 57 mins on an Ultra II-440MHz with 512MB.
>      The RTL and gates don't have exactly the same hierarchy either since
>      some blocks get flattened in our synthesis tool.
>
> The above machines run Sun OS 5.6.
>
>     - Tom David
>       Cygnal


From: Umberto Rossi <Umberto.ROSSI@st.com>

Hi John,

When I reported my figure of ~2.5 hours vs inconclusive for 40K gates, I
meant to show the enhancement of Formality 1999.10-FM1.0 vs Formality
2000.05-FM2.0.  For the same versions I reported also the case of ~500K
gates.

Now, 2.5 hours is NOT the typical run time for a 40K gate design, as it
may appear from the comparison between my case and Tom's Verplex numbers.
I have seen a *huge* range of equivalence checking performance over
designs of similar size due to how the design was implemented.  You can
find examples - typically datapath - where execution times span from a
few seconds to inconclusive for nearly same design sizes!

I have in my regressions a significant number of designs, but providing
an average performance does not really make sense.  I have found cases of
performance similar to Tom's, much worse and much better but, in order to
make an assessment, the comparison should be apples to apples.

I do not think it is really possible to define a set of independent test
cases to be tried on different products as it is for logic simulators.  In
fact in the case of simulators, the final target is to minimize the number
of generated events, which represents a benefit for ANY circuit and features
a threshold for each circuit, whereas in Formal Verification world (not only
Equivalence Checking !) each circuit may feature an appropriate "ordering"
that makes the verification run in a few seconds rather than "infinite".

Hope this helps,

    - Umberto Rossi
      STMicroelectronics                         Agrate Brianza, Italy


( ESNUG 375 Item 10 ) ------------------------------------------- [06/28/01]

From: [ Should I Or Shouldn't I? ]
Subject: Are You Thinking Of Going Avanti Astro Or Sticking With Apollo II?

John,

Please keep me anonymous on this.  Thanks...

Avanti is very close to a production release of their next generation place
and route system (Astro) which stands to replace Apollo.  Avanti has not
been able to articulate any details to us about how Astro will benefit their
customers by way of real run-times, capacity, or quality of results other
than sales fluff.  They have stated that there are evaluations of Astro
going on presently in their customer base.  I wonder what results are
turning up out there?  Are people planning to move onto this new system?

    - [ Should I Or Shouldn't I? ]


( ESNUG 375 Item 11 ) ------------------------------------------- [06/28/01]

Subject: Linux Is Lousy For Servers & Get2chip's Linux vs. Sun Benchmarks

> We're a small company using & abusing a couple of HSPICE licenses on a
> few Sun Solaris platforms which are showing their age.
>
> We'd like to figure out the most cost-effective way of upgrading our
> HSPICE capabilities, both for the projects where we do in-house HSPICE
> simulations and where HSPICE is used "under the covers" (cell
> characterization software, for instance).
>
> I've heard lots of anecdotes about how a cheap x86 cluster (with "modern"
> processors) can provide more bang-for-the-buck than the corresponding Sun
> Solaris workstation setup.  This is especially because of the
> less-expensive HSPICE licenses for x86 boxen & OSes, and for our situation
> where we are only doing HSPICE simulation of small cell-level transistor
> netlists (instead of large blocks) so we don't need massive memory and/or
> memory bandwidth.
>
> I was hoping that either you or one of your readers can point me to some
> specs/benchmark results/examples that I could show to some of my coworkers
> for our discussions.
>
>     - Kim Flowers (Mr.)
>       Translogic Technology, Inc.


From: Adel Khouja <adel@get2chip.com>

Hi John,

I follow with interest the recent Sun/HP/Linux thread on ESNUG.  From the
beginning, we (Get2Chip) decided to use Linux for 100% of our development.
Market realities forced us to port all of our releases to Sun/Solaris, and
we expect that our next release will also be ported to HPUX.

Here are the details of our Solaris vs. Linux experience.  

                              Get2Chip Volare Synthesis
                         ------------------+--------------------
                                 Sun       |     Linux
                         ------------------+--------------------
  design1 (120K gate-eq)      100 min      |      45 min
  design2 (250K gate-eq)      210 min      |     100 min


What we used:

  SUN

  SUNW, Ultra-60;sparc; sun4u
  Physical Memory(RAM): 2048 Megabytes
  Virtual Memory(Swap): 2049 Megabytes
  Operating System: SunOS Release 5.7 Generic_106541-11

  OS Release is Solaris 7 8/99 s998s_u3wos_11 SPARC
  System is a Sun Ultra 60 UPA/PCI (2 X UltraSPARC-II 450MHz) 2.0 GB RAM
  Sun Microsystems Inc.   SunOS 5.7       Generic October 1998


  LINUX

  OS Release is Red Hat Linux release 6.2 (Zoot)
  System is a AuthenticAMD 1000 Mhz AMD Athlon(tm) Processor 960 MB RAM

  more /proc/cpuinfo
  processor       : 0
  vendor_id       : AuthenticAMD
  cpu family      : 6
  model           : 2
  model name      : AMD Athlon(tm) Processor
  stepping        : 2
  cpu MHz         : 1000.069690
  cache size      : 512 KB
  fdiv_bug        : no
  hlt_bug         : no
  sep_bug         : no
  f00f_bug        : no
  coma_bug        : no
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 1
  wp              : yes
  flags           : fpu vme de pse tsc msr 6 mce cx8 sep mtrr pge 14 cmov
                    fcmov 17 22 mmx 24 30 3dnow
  bogomips        : 996.15


  more /proc/meminfo

            total:     used:      free:   shared:  buffers:   cached:
  Mem:  2124222464 411144192 1713078272  16814080  45359104  97726464
  Swap:  838934528  25690112  813244416

  MemTotal:   2074436 kB
  MemFree:    1672928 kB
  MemShared:    16420 kB
  Buffers:      44296 kB
  Cached:       95436 kB
  BigTotal:   1114044 kB
  BigFree:     873324 kB
  SwapTotal:   819272 kB
  SwapFree:    794184 kB


I concur with one of your readers who said they had good experiences with
Dell hardware.  A Dell 2- or 4-way with 4 G of memory is a honking machine.
And even with Linux's current 2.8G memory limit, that's still big enough for
us to synthesize and optimize a 2M gate-equivalent design in 10-15 hours.
Pretty good for any class machine.


Using Linux as a file server is another story...

At the time when we were running Linux as our main file server, we basically
had one choice for a file systems, type ext2.  Although ext2 is fairly fast,
it did not have a journaling system nor quality file system tools.  If we
had to reset the machine, the system would take a painfully long time to
check the entire file system.

Secondly, with Linux, we ran into many headaches with driver and network
support.  Linux has a reputation for supporting cheaper hardware well and
more exotic high end equipment, poorly.  With devices like a fast Mylex
RAID card, we had difficulties getting initial drivers and then finding
updates to known problems.  As well, not all drivers were developed equally
as was the case for 3COM network cards (one of the biggest culprits).

To sum it up, we had a combination of hardware driver problems and
Linux/file system problems, the worse possible situation.  It should
be noted that we NEVER lost any data during our ordeal.

The Linux file server specs:

     2x 550Mhz PIII
     512M RAM
     2x Intel Pro100b+
     Mylex Extreme RAID 2000 with 128M RAM
     Seagate 10K Cheetah Drives

We have since migrated off our main Linux file-server to a Network
Appliance 760.  This unit has been rock solid and has returned on our
investment several times over by offering superior performance (raw
throughput), greater reliability, and unsurpassed technical support.


Suffice to say that we're very happy with Linux and plan to continue our
development/porting strategy.  Lintel boxes give us 200+% runtime
improvements over Sun/HP.  Yes, Lintel boxes are sometimes less stable,
but we've found that is usually due to flakey h/w rather than sw/OS issues.
When we've had problems with Linux, it's generally been because of H/W.

    - Adel Khouja, R&D
      get2chip.com

         ----    ----    ----    ----    ----    ----   ----

From: [ Better Late Than Never ]

Hi John,

You're publishing too quickly; I didn't get a chance to respond before
the next ESNUG came out.

We have about 70 linux systems in a server farm.  They're all running
Debian 2.2.  The original machines were do-it-yourself PIII 500 MHz.
The latest are PIII 1+GHZ machines from ARM.  They all have IDE drives
and network cards compatible with the tulip driver.  We were using a
disk image to install, but now we're using VA Linux's systemImager.  To
upgrade the farm, we upgrade one machine (apt-get update; apt-get -f
dist-upgrade), test it with all our software, and then copy the image
to all the other machines over the network and reboot.  They're all
single processor which fits well with the one job / one machine queueing
software we're using.

Our main simulator is VCS.  We also use VirSim, SureLint, and SureCov.
The system guys are heavy users of matlab on these machines.  256MB is
plenty of memory for most of the designs we've run with these tools,
but maybe it's not enough memory as you move closer to the backend.

We just started trying dc_shell and PrimeTime on linux.  I'll try any of
our tools on linux when they're available.

Please keep me anonymous.

    - [ Better Late Than Never ]


( ESNUG 375 Item 12 ) ------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #3 ) CynLib C Benchmarks 2X To 5X Faster Than SystemC

> Here are the results of our experiments with this benchmark including a
> simulation compiled without compiler optimization:
>
>   ESNUG 373 SystemC code [1] : ############## 142.4 sec
>   ESNUG 373 SystemC code [2] : ###################### 224.9 sec
>   ARM SystemC code [1]       : ### 36.6 sec
>   Verilog equivalent [3]     : ######################## 238.0 sec
>
>   [1] SystemC 1.2; "gcc -g -O3 -march=i686"; 550MHz Pentium III
>   [2] SystemC 1.2; "gcc -g -O0"; 550MHz Pentium III
>   [3] VerilogXL 3.2; "verilog +turbo"; 550MHz Pentium III
>
> This suggests that the ESNUG 373 #2 results were generated using an
> unoptimized compilation.
>
>     - Jon Connell
>       ARM


From: Bernard Deadman <bdeadman@sdvinc.com>

John,

Comparing SystemC with any event driven Verilog simulator is unfair.
SystemC is "cycle accurate", so why not make the comparison with a 
cycle-based Verilog simulator because that's giving you the same level
of  precision?  I've never tried it but my guess is the performance of
SystemC would look pretty sick in that comparison.

Overall, SystemC *is* slow - the way to get real performance is to jettison
a lot of the simulation kernel and get closer to raw C++.  My view is
SystemC and all of the other class libraries are too heavy.  If you want 
great simulation performance its time for a 'lite' version, with a minimal 
set of classes and let the user to add in the extra stuff he needs rather 
than burden the SystemC core with a *ton* of stuff most people don't 
use.  A competent user can add in just the extra bits he actually needs.

    - Bernard Deadman
      SDV, Inc.                                  Austin, TX

         ----    ----    ----    ----    ----    ----   ----

From: John Sanguinetti <jws@forteds.com>

Hi, John,

The SystemC benchmark from [ Emperor ] and Jon Connell's optimized version
beg further analysis, and a comparison with the same code in C++/Cynlib.

The original SystemC code from [ Emperor ] has a glaring inefficiency in it:

        ...
        if (load) {
          sc_bool_vector tmp(9);
          tmp = (sc_bv<9>) count_nxt;
          save_count_out = count_nxt;
          parity_out.write(tmp.xor_reduce());
          ...
        }

The temporary variable tmp is created and destroyed every time through this
block (nearly every clock cycle).  If you make that a class member variable
instead of an automatic, then this code runs about 3.7x faster using
Connell's test bench.

If we look at what Connell did, you can see he did three optimizations:

  1) As he said, he switched the sc_bv<9> in the original to sc_uint<9>.
     The switch to sc_unit makes about an 18% improvement over the original.

  2) He replaced the call to xor_reduce with his own code to do it.
     This:

          sc_bool_vector tmp(9);
          tmp = (sc_bv<9>) count_nxt;
          parity_out.write(tmp.xor_reduce());

     Was replaced by this:

         for (int ii = 0; ii < 10; ii++) tmp ^= count_nxt[ii];
         parity_out.write(tmp);

     This is the big winner, because by making tmp a member variable of 
     type bool, instead of an automatic of type sc_bool_vector, he 
     eliminated the creation and destruction of an sc_bool_vector on every
     clock cycle.  This makes the code run 4.3x faster than the result of 
     optimization (1).

  3) He replaced

        if (up == 0 && down == 0)
          count_nxt = data_in;
        else if (up == 0 && down == 1)
          count_nxt = cnt_dn;
        else if (up == 1 && down == 0)
          count_nxt = cnt_up;
        else if (up == 1 && down == 1)
          load = 0;

     with

       int mode = (up << 1) | down;
       switch (mode) {
         case 0: count_nxt = data_in.read(); break;
         case 1: count_nxt = cnt_dn; break;
         case 2: count_nxt = cnt_up; break;
         default: load = 0; break;
       }

     This new code looks more like the original Verilog, but it doesn't 
     have much of an effect on the performance.  It only improves the speed
     by about 4% over optimizations (1) & (2).

So the results of running the original and the above three optimizations on
my 400 MHz Sun e/450 are:

                 original    opt 1     opt 1+2    opt 1+2+3

      cpu time   522 sec.    442 sec.  103 sec.   99 sec.


Now to the interesting part.  Why was that temporary variable there in the
first place?  It was there because you can't apply the xor_reduce() function
to an sc_uint<>, but only to an sc_bool_vector<> (or sc_bv<>), so the
assignment was a convenient way to cast count_nxt to the needed type.  Note
that Jon Connell solved the problem by just not using the built-in
xor_reduce function.  The real culprit here is that SystemC has too many
data types, and not all functions are available for all of them.  If you
write your code in a reasonably natural way, like [ Emperor ] did, you may
very well get  unreasonable results.


Now compare the above with the same code written in C++ using Cynlib and the
(free) Cyn++ preprocessor:

  Module up_down (In<1> clk, In<1> up, In<1> down,
                  In<9> data_in, Out<1> parity_out, Out<1> carry_out,
                  Out<1> borrow_out, Out<9> count_out)
    
     Uint<9> count_nxt;
     Uint<10> cnt_up, cnt_dn;
     Uint<1> load;
   
     Always (Posedge(clk))
         cnt_dn = count_out - 5;
         cnt_up = count_out + 3;
         load = 1;
         switch( (up,down) ) {
             case 0: count_nxt = data_in; break;
             case 1: count_nxt = cnt_dn; break;
             case 2: count_nxt = cnt_up; break;
             case 3: load = 0; break;
         }
         if( load ) {
             parity_out <<= CynRedXor(count_nxt);
             carry_out  <<= up&cnt_up(9);
             borrow_out <<= down&cnt_dn(9);
             count_out  <<= count_nxt;
         }
     EndAlways
  EndModule

This is at least as easy to read as the Verilog (or Superlog) and just as
concise.  Running Jon Connell's testbench with this code took 32 seconds,
a difference of 3x.  I ran some other testbenches to see how SystemC and
Cynlib scale (since that was one of Emperor's complaints), and I got the
following results:

                      SystemC               SystemC
     Instances    (Emperor's code)   (Connell's patched code)     Cynlib

     1 ( 1M cycles)     42 sec                21 sec                4 sec
    10 ( 1M cycles)    231                    56                   16
   100 ( 1M cycles)   2164                   391                  173
  1000 (10K cycles)    218                    43                   17

Note that scaling with size is not really a problem in any of these after
about 10 instances.  You can see that Cynlib runs between 2 and 5 times
faster than the optimized SystemC does on this model.

I should also note that Emperor's comments about SystemC taking a long time
to compile are accurate.  The 100 instance version took 2,460 seconds (41
minutes!) to compile at -O3.  Cynlib took 24 seconds.

So the Emperor's C conclusions don't hold up, at least for Cynlib C:

  Speed - Cynlib is 2x-5x faster than SystemC, and from these 
          measurements appears faster than Superlog.

  Reliability - Cyn++ is at least as easy to read as Verilog, and has 
                the advantage of having C expression semantics.

  Quality - I think Emperor meant debuggability here, and he's right, 
            SystemC is truly difficult to debug.  Cynlib has two tools
            to help here, Cyngdb being free for run-time debugging, and
            Cyntax, a Forte product, for code analysis.

  Software - Cynlib is just C++, so as he said, running a hardware 
             model with the code to drive it is simple.

  IP - We don't know what IP providers will write their models in, but
       by far the largest amount of IP available is not hardware models
       in Verilog but algorithms in C.  It's a lot easier to incorporate
       a C algorithm in Cynlib (or SystemC) than in Verilog.

  Freeware - It's hard to justify the price of a commercial HDL
             simulator when you can do the same job, with the same
             effectiveness, for free.  Of course you've got to buy design
             tools, but by using Cynlib and C++ you don't have to pay for
             simulation, allowing you to spend your tool budget on better
             productivity tools.

Finally, Wilson Snyder's comments about using SystemC to code at an RTL
level are on the mark.  It just is not very easy, and there really isn't
much point.  Cynlib is a lot easier, due in part to Cyn++ but also due to
the more rational data types and port semantics.  However, where both
SystemC and Cynlib are in their element is when elaborating a hardware
implementation from a C algorithm.  If you come from the top down, using
C++ for hardware description is quite reasonable, particularly if you use
Cynlib.

    - John Sanguinetti, CTO
      Forte Design Systems

         ----    ----    ----    ----    ----    ----   ----

> I am sure over time the SystemC kernel will vastly improve.  They haven't
> optimized the pin interconnect.  They aren't optimizing between modules.
> They aren't inlining modules.
>
>     - Wilson Snyder


From: "Janick Bergeron" <janick@qualis.com>

Hi, John,

Although that would be a great benefit, I'm wondering what would be the
motivation for *anyone* to work on improving the systemC kernel?  While EDA
companies can hope to recoup their investment in improving the performance
of their Verilog/VHDL simulators, there is no such financial benefit in
improving SystemC's performance.  Quite the contrary: you'd be canibalizing
your own VCS simulator sales.

It's not going to come from corporate users either.  They are paid to
engineer products that will generate revenue for their employers.  Any
improvement to a free/opensource tool is going to remain proprietary to a
company as a competitive advantage.

Such is the burden of free/opensource tools.

The only way I currently see SystemC's performance being improved is through
a model similar to Linux: self-motivated hackers, working on their own time,
who would get a kick to have their modification included in the "official"
release.  SystemC doesn't have that going for it, but I do expect to see a
lot of "Improving SystemC" papers from students in the near future.

    - Janick Bergeron
      Qualis Design


( ESNUG 375 Item 13 ) ------------------------------------------- [06/28/01]

Subject: ( ESNUG 374 #3 ) The Emperor Strikes Back On His SystemC Benchmark

> The anonymous author of the "Embarassment" notes that SystemC is 4.5 times
> slower then SyperLog.  SystemC is good at writing behavioral code; bad at
> writing Verilog code.  If you're switching to SystemC as a replacement for
> writing Verilog, you're missing the point, and will be disappointed.
>
> As to the specific example, I made up my own little test bench, since none
> was provided.  I suspect the author didn't compile with optimization of
> the SystemC library.  Furthermore, bit vectors in SystemC are MUCH slower
> than integers.  Either his example should use sc_bit and sc_bv, or bool
> and int. Since the author used bool, I think it's fair to use int's also.
>
>     - Wilson Snyder


From: [ The Emperor Has No Clothes ]

Hello, John.

Anonymous as usual.

My SuperLog implementation of the up/down counter used bit vectors.  These
are arbitarily long 2-state vectors.  The SystemC sc_bv is also an
arbitarily long 2-state vector, so I chose that in preference to sc_uint as
a balanced comparison, since the SystemC sc_uint type is limited to 64 bits.
Furthermore the example performs bit manipulation that was easier using
SystemC sc_bv types than sc_uint types.

The examples that I originally made did contain a lot of redundant code;
both for the SuperLog and the SystemC versions.  Each counter instance had
its own stimulii instance and I also had an additional cycle counter in the
SuperLog example, plus an extra layer of heirarchy.  The stimulli instances
had a reset phase to preload the upDown counters in cycle 1 with a value of
200.

I have however analysed Connell's SystemC code in ESNUG 374 #3.  I increased
the upDown instances from 4 to 10 and added a printf to print the results
after the 5 Million clock cycles.  I have also updated my SuperLog example
to reflect the simplified testshell heirarchy in that new SystemC example.
The SystemC is now using sc_uint, no longer an abstract length bit vector,
so this is no longer a like-with-like comparison.

The SystemC is also compiled, whereas the SuperLog was interpreted, so the
compile time of the example becomes significant.  Optimisation was -O3.


            SystemC compile              11.0 sec
            SystemC 5M run               71.2 sec
                                        ---------
            SystemC total                82.2 sec

            SuperLog analyse & 5M run    54.2 sec


So even in an unfair comparison, interpreted SuperLog is still faster than
SystemC.  I cannot evaluate how fast compiled SuperLog would be.  Perhaps
the professionals at Co-Design, Inc. could eke out more performance from
this example.

The interesting thing is the results generated by SystemC for the 10
instances after the 5 Millon clocks:

     448 1 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0
     448 0 0 0

The first instance parity output is correct, the others are wrong.

In SuperLog the bit vector allows bit manipulation

   bit[8:0] count_nxt;
    ...
    parity_out <= ^count_nxt;

In the submitted SystemC using sc_uint this needs to be replaced by a rather
more complicated iterated assignment to an intermediate temporary variable:

    sc_uint<9> count_nxt;
    bool  tmp;
      ...
        for (int ii = 0; ii < 10; ii++) tmp ^= count_nxt[ii];
        parity_out.write(tmp);

For the correct result, 2 fixes are required, tmp initialised and iterator
limited to actual width:

    sc_uint<9> count_nxt;
    bool tmp;
      ...
        tmp = 0;
        for (int ii = 0; ii < 9; ii++) tmp ^= count_nxt[ii];
        parity_out.write(tmp);
          // worse could happen in VHD-Hell


So as well as being embarassingly slower than interpreted SuperLog, SystemC
is also embarassingly more error prone.


The SystemC sc_uint type does make the simulation faster, but then the
system implementor has had to make system design choices based on the
implementation of the simulator tool.  These choices are then propogated
through the design heirarchy into the test bench - so scaleable generic
design is not possible.

I am not a SuperLog expert.  I am certainly no expert on SystemC.  But if
efficient system design requires mucking around with the innards of the
simulator then the future of system design looks bleak.

For simulation speed of RTL, compiled simulators are certainly faster.  The
NC Verilog simulator is 3.1x faster than the SystemC, without the global
optimisation to 2 state values that were applied to the VCS.

My updated SystemC & SuperLog examples, based upon the Jon Connell example,
extended to 10 instances, are enclosed.  A Verilog testbench as well.

    - [ The Emperor Has No Clothes ]


 Editor's Note: These files are in the DeepChip.com "Downloads".  - John


( ESNUG 375 Item 14 ) ------------------------------------------- [06/28/01]

Subject: ( ESNUG 373 #6 ) DC 99.10-6 Putting 2 Inverters In My Reset Paths!

> I am using DC 99.10-6.  We are doing a million gate bottom up synthesis.
> We are planning to use CTS both for the reset and clock networks.  We put
> following constraints on clk and reset:
>
>    set_drive 0 clock_grp
>    set_dont_touch_network clock_grp
>
>    set_drive 0 rst_l
>    set_dont_touch_network rst_l
>
> But still DC puts 2 inverters in the reset path.  Also there is no logic
> between reset and flop in my design.  Is there any issue with this version
> of DC?  (Interestingly it doesn't put any buffers/inverters on the clock
> network path though its treated exactly like reset.) 
>
>     - Rajendra Marulkar
>       Marconi Communications


From: Siegfried Weidelich <Siegfried.Weidelich@McDATA.com>

Hi, John,

I had the same problem w/ my synchronous resets (driven by a clock buffer).
DC does not treat resets and clocks the same.  Clocks are ideal, so it knows
not to buffer them up no matter what.  Adding this command:

                   set_ideal_net "reset_net_name"

worked for me in DC 1999.10, but I have heard of cases where it doesn't
(depends on DC version, and which design rules are violated).

DC 2000.11+ versions have a new command (clean_buffer_tree "driver.pin") to
remove buffers from a net across hierarchy (I assume to get around this DC
"feature"), but I haven't tried it. 

    - Siegfried Weidelich
      McDATA Corporation

         ----    ----    ----    ----    ----    ----   ----

From: JEAN-MARC.CALVEZ@st.com

Hey, a question I may have an answer for!

I have already seen a similar behaviour.  In my case, here is what happened:
all STMicro flip-flops with async reset that I know of/use have an active
low reset; however some designers, believing that the One True Standard is
the active high signal, code their RTL accordingly.  Post-elaboration, one
will get a reset signal connected to the reset pin of a generic register
(Synopsys's seqgen); post-mapping, the generic register will have been
replaced by a real one and an inverter will be inserted in the reset path
to preserve functionality.

If there is a dont_touch_network set on the reset, DC will pick whatever
inverter is available in its library and, once the initial mapping is done,
will leave it at that, no matter how many DRC violations it causes (after
all, there is a set_dont_touch_network on the reset, which implies that
there will be a dont_touch attribute set on whatever combinational cell
on the reset path).

Now, consider that this design is actually a sub-block of a larger design,
where the active-low reset convention is adhered to: when the 1st design is
instantiated, the reset port will be connected to the opposite of the
top-level reset: post-elaboration, the gtech inverter will be inserted;
post-mapping, that inverter will be mapped on whatever inverter is available
in the library, and the comment above on set_dont_touch_network will apply:
hence, 2 inverters, poorly chosen, back to back on your reset signal.

In general, I have always found out the hard way that set_dont_touch_network
had a lot of unwanted side effects, and thus I advocate against its usage, 
unless the designer really knows what she/he is doing.

    - Jean-Marc Calvez
      STMicroelectronics

         ----    ----    ----    ----    ----    ----   ----

From: Ansgar Bambynek <a.bambynek@avm.de>

Hi John,

I had a similar problem with DC inserting inverters into the reset path.  I
contacted the ASIC vendor and Synopsys.  They told me that they filed a STAR
since it should have something to do with the library.  Unfortunately I
didn't hear anything about.

The problem arose when synthesizing FFs which have only a low active reset.

Here's the original constraint (different then the one used by Rajendra):

       set_driving_cell -none reset
       set_dont_touch_network reset

What helped was the following instead

       set_drive 0 find (port, reset)
       set_max_fanout 1000    reset

The fanout load on the reset port of my block is much smaller than 1000.  I
used the 1999.10-4 release of DC when facing this problem.

    - Ansgar Bambynek
      AVM Computersysteme GmbH                   Berlin, Germany

         ----    ----    ----    ----    ----    ----   ----

From: Dave Smith <Dave.Smith@st.com>

This appears to be a bug.  I noticed it on one of my designs, and here is
the summary:

I have noticed a problem using set_dont_touch_network on reset trees, on
versions prior to 2000.11:

  If a net has set_dont_touch_network applied, and goes to a number of cells
  without an inverter, but also feeds an inverter somewhere, then all the
  uninverted cells will have two inverters connected to their input, to form
  a buffer, and all of these buffers will be connected together.

  An example of this is as follows:

  Consider a design "top", with two sub-designs "sub_a" and "sub_b".
  "top" and "sub_a" have active-low reset inputs (not_reset), but "sub_b"
  has an active-high reset input (reset).

  Each of the sub-designs have 500 flip-flops, all using the same
  asynchronous reset input.

  set_dont_touch_network is applied to not_reset at the top level, but
  the signal is inverted before going in to "sub_b", in order to preserve
  the function at the hierarchy boundary.

  After synthesis, the reset tree will be as follows:

  not_reset will feed one inverter at the top level, which will connect
  down in to "sub_b". In the "sub_b" block, reset will connect to one
  inverter, which will then connect directly to the reset input of every FF.

  not_reset will also feed in to the "sub_a" block. However, in this case,
  the not_reset input of the sub block will connect to 500 individual
  inverters, and each of these inverters will connect to another inverter,
  and then to a FF reset input. As a result, every FF in "sub_a" will have
  two inverters bufferring its reset input.

  This problem takes no account of hierarchical boundary - provided there
  is an inversion somewhere in the dont_touched network, all of the
  non-inverted cells will get 2 inverters added.

The good news is that this bug is fixed in 2000.11 (but not in 2000.05)

Alternatively, I've had a mail from Synopsys support, suggesting the
'remove_buffer_tree' command.

    - David Smith
      STMicroelectronics                         Bristol, UK

         ----    ----    ----    ----    ----    ----   ----

From: Robert Wiegand <RWiegand@NxtWaveComm.com>

Hi John,

I have had other problems in the past with resets, and I condition them
slightly differently.  Instead of using set_dont_touch_network, I do the
following:

  if (find(port,all_reset_ports)) {
     set_drive      0 find(port,all_reset_ports)
     set_resistance 0 find(net, all_reset_ports)
     set_dont_touch find(net,all_reset_ports)
     set_ideal_net find(net,all_reset_ports)
  }

Where all_reset_ports is a user variable listing all possible port names
for reset.  The construct find(net, all_reset_ports) is valid when the
net names attached to the port match the port name.  I believe this is
the default behavior, but might be effected by:

      write_name_nets_same_as_ports = "true" 

This seems to work for me.

    - Bob Wiegand
      NxtWave Communications                     Langhorne, PA


============================================================================
 Trying to figure out a Synopsys bug?  Want to hear how 11,000+ other users
    dealt with it?  Then join the E-Mail Synopsys Users Group (ESNUG)!
 
       !!!     "It's not a BUG,               jcooley@world.std.com
      /o o\  /  it's a FEATURE!"                 (508) 429-4357
     (  >  )
      \ - /     - John Cooley, EDA & ASIC Design Consultant in Synopsys,
      _] [_         Verilog, VHDL and numerous Design Methodologies.

      Holliston Poor Farm, P.O. Box 6222, Holliston, MA  01746-6222
    Legal Disclaimer: "As always, anything said here is only opinion."
 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com



 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)