Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS


( ESNUG 358 Item 3 ) --------------------------------------------- [8/23/00]

Subject: ( DAC 00 #39 ) The Unisys 1.5 Mgate IBM ASIC PhysOpt Tape-Out

> Click on http://www.deepchip.com/items/snug00-18.html and you'll find the
> April 2000 score board of known physical synthesis tape-outs for everyone:
> Magma, Monterey, Cadence PKS, PhysOpt.  The data for everyone there, other
> than PhysOpt, still holds true for July 2000.  That is, no customer to
> date, has yet to use Magma, Monterey, nor PKS to make a real chip.  In the
> PhysOpt table, add another tape-out in April by nVidea (the MV12 chip),
> plus a new PhysOpt user tape-out by Unisys for July.  The Unisys chip is
> a 1.5 million gate, 0.18, IBM fab ASIC with three clocks at 100, 133, 200
> Mhz.  Ken Merryman of Unisys is the designer.  This means that PhysOpt
> now has a total of 10 confirmed customer tape-outs as of July, 2000.


From: Ken Merryman <kenneth.merryman@unisys.com>

Hi, John,

I saw that you referenced my tape-out in the PhysOpt part of your DAC Trip
Report so I thought I'd share with you the details of that tape-out.

Before using PhysOpt, our standard flow for IBM chips at or below 0.5 um
was to rely heavily on floor-planning plus lots of manual placement.  On
a single design we have one main backend guy who focuses on manually
floorplanning, manual placement, re-buffering, manual drive strength
adjustment, and doing re-timing in parallel along with the logical
verification.  We also have three other guys who (respectively) do clocktree
synthesis, IBM routing, and Einstimer runs.  We also have the last guy who
does synthesis (that's me.)

This process was very ECO friendly because the team would retain the same
net names throughout every stage and revision of the design.  (We worked
very incrementally, holding registers and I/O constant so we wouldn't have
to rework our clock trees and the placement.)  It could easily take 2 to 8
weeks to finish the physical work on a design.  After we were done, IBM
would take our design, run their DRCs on it, and fab it.

This methodology generally worked very well for us with 0.5 and 0.25 micron
IBM ASICs, but started choking at 0.18 and 0.12 microns.  The usual wire-
rule-based optimization solutions (which didn't account for the higher
percentage of foil delay) were just not adequate starting points in many
cases.  Our standard floorplan/re-buffer/re-size flow couldn't close these
0.18 designs.  We ran into trouble with the largest ASIC in our ES7000
Enterprise Server design.  It was an 0.18, 1.5 million gates, 325,000 cell
instance, 5 clock, 100/200 Mhz mixed latch and flip-flop IBM ASIC.  The
physical design team had been trying to solve the last thousand timing
problems for two months while the other ASICs in the chip set were finishing
logical verification.  Worst negative slack was still 1.2 nsec (on a 5 nsec
path) and it didn't look like timing closure was going to happen any time
soon.

To summarize our 0.5 to 0.25 um flow:

    - basic flow
        1.) DC
        2.) Internal "Name Retain" Tool
        3.) IBM Booledozer
        4.) Internal FP
        5.) IBM ChipBench
        6.) Clocktree
        7.) Einstimer
        8.) HDP/Proprietary FP used for placement
        9.) RC extraction with Chip Edit

    - internal design budgeting solution
    - Highly Incremental
    - Stacks and hierarchical floor-plan of blocks 10-40K instances
      (avg. ~250 per design)
    - Flat placement
    - Synopsys verified that this approach was realistic and consistent
      with their recommended CWLM approach.

    - NO timing driven layout
    - Flow caused many iterations between P&R and synthesis
    - 10 - 15 Iterations due to pre-vs-post timing varied wildly
      (most we don't go back to DC; once we're in the backend, we
      tend to stay in the backend.)
    - Manage to hit target performance at cost of delayed schedule
    - Required much custom tools and hand tweaking to achieve closure


At that point Synopsys agreed to help us run it through their PhysOpt.  It
took us about a month to get a gates-to-gates PhysOpt process up and running
although the majority of our early work was dealing with the lack of IBM
libs and getting file format differences worked out.  Since this was a very
new tool, we had to write our own tool to translate IBM's SA12 lib (in IBM
VIM format) to LEF (which is PhysOpt useable.)  We used PhysOpt's lef2plib
conversion utility to get the SA12 data into PhysOpt.

I've heard that IBM now gives out SA12 LEF or Synopsys .pdb libs for PhysOpt
(I'm not sure which).

We also had to work out the PhysOpt SA12 fudge factor for IBM's Einstimer.
You have to tweak PhysOpt R & C values to agree with Einstimer because it's
trying to correlate two different timers (PhysOpt's and Einstimer's.)  We
had to find the SA12 metal layer "multipliers" ("multipliers" are an
erroneous Synopsys name -- they're actually replacement values for extracted
R's and C's for PhysOpt sorted by metal layer.)  PhysOpt has about a dozen
"*_multiplier" variables/commands for the veritical and horizonal R and C
values.  To be sure that PhysOpt is using your overrides for R's & C's in
each metal layer, look for (GR-4) and (GR-5) messages in your PhysOpt log:

    Information: Using user specified R and C coefficients.  (GR-5)
    Information: Cap -  horizontal: 3.7e-07   vertical: 3.9e-07 (GR-4)
    Information: Res -  horizontal: 0.0006   vertical: 0.0005 (GR-4)

Otherwise it crafts its own localized RC values.  After this calibration,
our PhyOpt timing correlated within 2% of IBM SA12 Einstimer timing.
Synopsys has now automated this calibration process, but when we used
PhysOpt, we had to do it all ourselves. 


Also included in our month long PhysOpt ramp-up was some training time
since two of us working on this chip had little experience w/ the physical
design process and the file formats used to exchange data.  


We primarily used PhysOpt in single pass "-post_route" mode.  We mostly
experimented with different sets of fixed placement since, out of
cautiousness, we did not want to destroy the physical work already done to
the design.  Basically we started with a fully placed *flat* design that
our physical design engineer had been struggling with.  

Our PhysOpt optimization script was basically:

  1. setup libraries, paths etc
  2. read_edif
  3. read_pdef
  4. source the back annotated capacitance file
  5. read_sdf
  6. report_design
  7. source clock script
  8. source the top level constraint file
  9. physopt -check_only
 10. set the capacitance and resistance multipliers
 11. report_timing for initial state of the design
 12. physopt -post_route -incremental
 13. report_timing for results
 14. write out the results - pdef, db, EDIF

The PhysOpt compiles ran from 7 to 24 hours and brought our Worst Negative
Slack down to around 200 pico-seconds.   We still had a few problems
requiring manual attention mostly caused by our fixed placement directives.
Our lead physical design guy said that the problems left over by PhysOpt
were difficult or impossible to fix.  (i.e. His way of saying "PhysOpt did
a really good job".)

To summarize our 0.18 to 0.12 um flow:

  - HDP/PhysOpt/HDP
  - Top level floor-plan stayed the same
  - No regioning required
  - Inputs/outputs to PhysOpt: plib, PDEF3, db, SDF, EDIF, and cap files
  - Converted VIM/PDEF3/EDIF from HDP to plib/PDEF and back for IBM HDP
  - Converter work only required a few simple tweaks

Improvements in our PhysOpt constraints and fixed placement filters let us
tape-out the chip on schedule.


Gotchas
-------

The most important gotcha we ran into with PhysOpt was to NOT do the old
Design Compiler habit of margining and/or slightly over-constraining our
design.  In fact, practically all of our PhysOpt issues had something to do
with mis-constraining our design.  PhysOpt will do a number of crazy things
if you lie to it or try to get it to do the impossible by overconstraining.

For example, we got burned by IBM's "I/O affinity cells" in PhysOpt.  These
IBM cells need to near the chip's I/O cells.  Since we didn't constrain the
I/O properly, PhyOpt very naturally moved the "affinity cells" away from
the I/O to make room for other logic.  Einstimer complained horribly.  Once
we eventually figured out the problem and accurately constrained all our
I/O's in PhysOpt, it then worked like a charm.   Most of our flow is hacked
together with our many of own tools mixed with IBM tools.  My PhysOpt
interactions were pretty simple; I only used a 50 line script to do all of
its work.  I've heard that in the new flow in the IBM design kits, the
required floorplan is generated automatically via some scripts and it takes
into consideration the I/O placement and affinity info, fat wire creation,
filler cells, RLM placement, hole-punching for I/O cells, etc.  If this is
true, it should simplify the data transfer from HDP to PhysOpt.

Another PhysOpt constraint-sensitivity example was that we didn't have all
our design's timing exceptions set.  This caused PhysOpt to take off into
lala land with very, very long runtimes.  The tool was basically going nuts
trying to fix timing on things that didn't need fixing.  Once we got our
timing exceptions in, PhysOpt settled down to realistic (>24 hour) runtimes
working on stuff that actually needed work.

We also found that when running the entire ASIC flat in PhysOpt, the run
time seems to be more dependent on the number of timing problems it has to
fix, rather than the size of the design.  (Synopsys says for 32-bit
machines, the upper limit for memory usage is 3.8 gbytes; our design was
2.2 to 2.4 gbyte.)  Using our standard physical process to get close to the
timing goals then letting PhysOpt finish the job worked well for us.
Although our plan is to decrease our need to do so much manual placement
and increase the work the PhysOpt does.

We also ran a 0.12 micron, 133Mhz - 9 clock design, with a 266 MHz
interface, 170,000 placeable cells and lots of RAM.  We tried running this
ASIC flat doing very minimal initial placement like RAMs and I/O's and
found the run times to be excessive (on the order of over 7 days.)  That
design started with TNS of 251,000 nsec and 32,546 violating paths and WNS
of 34 ns.  After establishing a good starting point with good constraints,
some floor-planning and register placement, the design starts with WNS 6 ns,
TNS 7,200, and 10,000 violating paths.  It runs in PhysOpt in a day and a
half to two days and finishes with a few pico-seconds of negative slack
which is better results than our manual process.

Good, *accurate* constraints and a little bit of floorplanning make a *big*
difference in successful PhysOpt runs.


One final gotcha that became apparent after using Phyopt was how it affected
our ability to do a metal-only ECO on our design.  Doing the entire ASIC
flat in PhysOpt in a "-incremental -postroute" mode destroys a lot of
hierarchical names.  This makes it very difficult to correlate the gates to
source code when doing a manual metal-only ECO -- since once we flatten the
ASIC, we're then forced have to operate on the *entire* ASIC.  I know
Synopsys is planning to phase out ECO Compiler, but we still need a solution
to this problem.  We typically back-fill unused areas on our chips with
gate array cells so we can do last minute metal-only changes.  We need a
PhysOpt "ECO mode" that leaves current cells placed and uses only available
back-filled gate array cells to make the changes.  Basically, John, we need
ECO Compiler or its capability added to PhyOpt.


We are currently working on developing our RTL-to-gates process which
includes budgeting and "divide and conquer" hierarchical process for our
larger ASICs.

    - Ken Merryman
      Unisys                                     Minneapolis, MN

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)