( ESNUG 404 Item 16 ) -------------------------------------------- [01/08/03]

From: Tim Lantz <goldfishgoldfish tim of taunetworks wrought calm>
Subject: Our DC, Monterey, Calibre, Simplex, PrimeTime, Formality Tapeouts

Hi, John,

A little over a year ago, we had the opportunity to implement a backend flow
from scratch at Tau Networks.  Here's our Monterey experience.

Our switch fabric chip set consists of two chips implemented in a 0.15u
technology.  The larger one contains more than 3M placeable objects, and the
smaller one more than 1M placeable objects plus RAM.  Both chips have high
speed 3.125 GBd serdes I/O designed as full custom blocks with some fancy
P&R work around them and then placed as macros.  The smaller chip also has
another custom block: a 20 Gbps implementation of the Network Processing
Forum Streaming Interface.

Our trial runs of Dolphin made us believe we could tapeout using Monterey's
tool set.  Their algorithms and approach made sense.  Their solution was
"gates-in to GDS out": floorplanning, physical prototyping, timing aware
logical optimizations and placement, timing aware routing and antenna
fixing.  Our P&R flow is strictly Monterey.  We don't have any other P&R
tools.


Our Basic Design Flow
---------------------

 a) Synopsys Design Compiler
 b) Floorplanning with Monterey ICWizard
 c) Quick estimates with Monterey Sonar
 d) Detailed P&R with Monterey Dolphin
 e) Magic (enhanced) layout editor for custom/flip chip cover cell.
 f) Mentor Calibre LVS/DRC/Antenna
 g) Simplex Extraction and Power Analysis
 h) Synopsys PrimeTime
 i) Synopsys Formality


Floorplanning with ICWizard
----------------------------

ICWizard was used to floorplan our two multimillion gate hierarchical
chips.  ICW's command language (python) is cumbersome and syntax
intensive.  This made the initial flow development difficult and time
consuming.  However, once the flow was scripted and driven by a
makefile, it easily worked with RTL, black boxes, size estimates, and
gate level netlist.  ICWizard is a very powerful chip planner that
allowed us to quickly change block sizes, pin placements, and to
estimate global routes.

ICWizard cannot insert power with correct design rule constraints;
therefore, power insertion required two passes.  ICWizard's power
insertion is used for pin placement obstructions and better global
route congestion analysis.  The power structure is then duplicated
in Dolphin, which creates the final design rule correct power mesh.
Monterey tells us this has been fixed/improved in current releases.

Two pin placement algorithms for placing pins on macros exist in
ICWizard.  The first algorithm is a quick pin placement based on
all top level connectivity for the level that you are working at.
The second pin placement algorithm uses global route.  This second
algorithm works well, except that it takes time to correctly specify
blockages inside of mega cells if they all are not blocked to the
same metal level.  The difficulty comes about from having mega
cells that have blockages at different levels, which we did.

Quick P&R results with Sonar
----------------------------

Sonar is very powerful because of its quick run time and good
correlation to Dolphin.  In most cases the results of Sonar matched
Dolphin to within 5-10%.  Sonar's quick run time helps to reduce
iterations due to incorrect inputs, poor floorplanning, or bad
netlists.  The placement and optimization algorithms do an amazing
job.  We spent considerable time investigating the critical path cell
sizes and placement.  99% of the results were excellent, but a couple
of our designs encountered two problems in the placement algorithm.

The first problem in the placement algorithm is that it may
incorrectly place flops next to an output port when the input to the
flop fails timing and the output delay is met by many nanoseconds.  We
were able to get around this by regioning the design.  The regions are
used during initial placement and they are usually obeyed unless the
region creates extra congestion or poor timing in which case Dolphin
will optimize the placement of cells out of the requested region.

The second problem is that high fanout nets in the critical paths
are ignored if the fanout goes above a hard coded threshold.  Before
Dolphin places the design, it removes all buffering, so the critical
path may have large fanouts.  The logic before the high fanout net is
clustered, as is the logic after the high fanout net, but the clusters
are not necessarily placed near each other.  In the current release,
v2.5, this is user controlled.

Detailed P&R results from Dolphin
----------------------------------

For large designs, Dolphin requires a large machine with lots of
memory: at our peak, we used 12 GB of RAM for our full chip.  Monterey
is working on a port to Linux which should significantly decrease the
run time for smaller blocks.  Also, they claim that the upcoming v2.5
has 30% faster run time than v2.1.  Sonar is the early
quadrisectioning of Dolphin, and Dolphin builds off of the Sonar
results so no time is wasted by running Sonar.  In general, Dolphin
takes more time than other P&R tools that we've used; however it
performs more optimizations, giving better end results.  Many hand
fixes that we've had to do with past tools were not required with
Dolphin.

One of Dolphin's biggest strengths is its CTS engine, which allows the
user to balance separate trees and to utilize useful skew.  It is
difficult to control latency, but the overall skew was consistently
low.  We specified the clock insertion of every block when we put the
top level together and Dolphin did the right thing and balanced the
trees at the top level.  Where we specified things correctly, the
resulting clock trees in all the block level and chip level designs
did not require any manual fixes.

The router is one of Dolphin's biggest strengths and biggest
weaknesses.  It does a good job, but isn't very fast.  For all of
Dolphin's parallelism, antenna avoidance and diode insertion are
performed independently at the end of the routing process, and since
you know the detailed route is already done, it seems excruciatingly
slow.  Although the run time in v2.1 is relatively poor, the router
does produce LVS/DRC clean GDS.  Many times, we went from Dolphin
straight through Calibre with LVS/DRC clean results.  LVS text is also
difficult to insert and does not have any features that allow the user
to rename ports or to make power ports non-unique.  We have seen
significant speed up in the router in our initial usage of v2.5.

PrimeTime, Simplex, & Dolphin Timing
------------------------------------

Unfortunately our DEF extracted Simplex SPEF file timing runs with
PrimeTime did not always correlate to Dolphin.  They differed up to
5%.  With help from Simplex and Monterey, we found that in comparison
to the standard, Quickcap, Simplex was pessimistic and Dolphin was
optimistic.  Monterey found a couple of bugs that have been fixed in
v2.5 and Dolphin now correlates to QuickCap very well.  Simplex has
also released a new version which isn't as pessimistic and we feel
comfortable that this issue is now resolved.

Dolphin is very good at the block level, but it is missing a few
features for an optimal hierarchical flow.  The top level chip is
treated the same as block level designs, such that a .lib and LEF is
required for all mega cells.  The LEF abstract generated by Dolphin
has to be modified due to extra pins and vias in the port definitions,
nothing that some perl can't handle.  The .lib generation in Dolphin
is slow, so we used PrimeTime.  The biggest problem at the top
integration level is that different types of buffering schemes are
required due to the area consumed by large mega cells, leaving small
"island" areas for buffer insertion.  We had many transition
violations at the chip level that were hand fixed when it seems as if
they could have been automatically fixed; the fixes involved simple
upsizing or moving a buffer a millimeter to center it in a line.

We came to realize that it was important to let Synopsys DC choose the
correct architecture/implementation for the logic, but then rely on
Dolphin to do the actual technology mapping.  Dolphin removes all
buffers and double inversions, and also does local logic synthesis
taking the placement and routing into consideration.  It makes sense
to minimize time in the synthesis tool and let Dolphin choose the
ideal logic and buffering required in order to meet timing.  Further
experimentation after tapeout found that we could reduce the area of
some blocks even further by not using any wireload models during
synthesis, just fanout rules and reduced cycle times to get the faster
architectures, and then letting Dolphin do the appropriate logic
mapping and sizing.  The savings were on the order of 10% for a one
million gate block.

Formality & The ECO Flow
------------------------

We always used Formality to verify both RTL-to-gates and gates-to-gates
at the block level.  We never found an error made by Dolphin.

The biggest problem with the ECO flow is incremental place on a global
or detailed routed netlist takes a very long time and the results were
less than optimal.  The high level overview of our ECO flow is:

  a) dump out a Verilog netlist
  b) edit the netlist with the ECO changes (use formal verification
     against your RTL as appropriate)
  c) run Dolphin's eco_netlist command to compare the ECO netlist to the
     current Dolphin database and recognize the changes
  d) have Dolphin dump out a DEF placement file
  e) start a fresh Dolphin database using the ECO netlist and the dumped
     DEF file (you now have a placement except for the ECO cells/changes)
  f) Now run incremental place to place your new ECO cells.
  g) Then continue on with optimizations, global route, and detail route
     to finish up.

We used area array I/Os, I/Os sprinkled through out the die in optimal
locations, and routed them with the upper layer of metal to the flip
chip bump locations.  Dolphin does not support this type of I/O
distribution and the routing from the I/O to the bump location so
we manually drew a "cover cell" in magic that had our top metal
redistribution layer (RDL) route from the I/O locations to the bump
locations and additional power and ground strapping.  This cover cell
was merged with the rest of the chip at final chip assembly and the
entire merged chip ran through LVS/DRC.  Our package has built in
power planes and by using the RDL and flip chip we were able to drop
down power and grounds straight into the middle of the die as needed.
This provided us a robust power grid.

Custom blocks were drawn with Magic, .lib and LEF files created, and
treated as macro blocks in the P&R world.  The P&R needs were always
considered when implementing physical characteristics of the custom
blocks, i.e. pin placement, aspect ratio, routing obstructions, etc.

Due to the complexity of our design, i.e. the hierarchy, the large
size of the design, the use of area array I/O, we definitely filed
our fair share of bugs and enhancement requests.  We had great support
from Monterey's technical team and they would roll our required fixes
into development builds and share them with us as needed.  We realized
the risks we were taking with development builds, but we never hit
any problems.  One of the keys to our success was Monterey's support
model, which involved not only an applications engineer as our
official conduit into Monterey, but also an R&D engineer who was our
champion in the R&D world, he would watch out for us in R&D meetings
and if we hit something really nasty.  Of course, this also benefits
Monterey by giving their R&D team immediate access to customers.
Most of our routine bugs were fixed in v2.1, with which we taped out.
Our designs and hierarchical needs uncovered some methodology issues
that are being addressed in future releases: issues relating to
flip-chip support and better top level buffering.


Overall
-------

Monterey provided us a set of tools, instructions, and good support.
We ran the tools ourselves, and while we needed bug and enhancement fixes,
we didn't rely on any taxicab mode operations -- all P&R was done by Tau
employees in house.  TCL and Perl skills are definitely required.

We were able to tapeout two complex chips in relatively short order with a
small team on a deterministic schedule -- I don't think we could have
achieved as much as we did with any other physical synthesis tool.

Both chips are back and working.

    - Tim Lantz
      Physical Design Lead
      Tau Networks                               Scotts Valley, CA


============================================================================
 Trying to figure out a Synopsys bug?  Want to hear how 14,488 other users
  dealt with it?  Then join the E-Mail Synopsys Users Group (ESNUG)!
 
     !!!     "It's not a BUG,               jcooley@TheWorld.com
    /o o\  /  it's a FEATURE!"                 (508) 429-4357
   (  >  )
    \ - /     - John Cooley, EDA & ASIC Design Consultant in Synopsys,
    _] [_         Verilog, VHDL and numerous Design Methodologies.

    Holliston Poor Farm, P.O. Box 6222, Holliston, MA  01746-6222
  Legal Disclaimer: "As always, anything said here is only opinion."
 The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com


 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)