( ESNUG 345 Item 4 ) ---------------------------------------------- [3/1/00]

Subject: ( ESNUG 344 #5 )  PhysOpt: 11 Runs; Timing-Driven Qplace: 2nd Base

> Doing PhysOpt synthesis and placement takes a lot of CPU time.  Most of
> our blocks ran in 2-24 hours, but we had one large nasty block which took
> over 80 hours to complete because we didn't take a true hierarchical
> approach with it.  (This block was over 1/4 of the entire chip in our
> multi-million gate design.)  ...  Ouch.  What we saw with this flat
> routing approach were that the global routes over modules sometimes
> interfering with the local routes of those modules, throwing the
> estimates that PhysOpt made on those nets off (sometimes by more than
> 1mm!).  Ouch.
>
>     - David Romanauskas, Design Engineer
>       Matrox                                   Montreal, Canada


From: Dave Balhiser <balhiser@fc.hp.com>

Hi, John,

At my division in Agilent, we're fanatics about using a pure hierarchical
methodology for everything (placing, routing, and static timing.)  This way
we can easily turn parts of the design independently and quickly.  In our
standard cell designs, subdividing the place and route problem also helps
to improve the correlation between wireload model (WLM) based timing and
extracted timing.

For us, a "block" is the level of the hierarchy at which standard-cell place
and route is performed.  A "block" can also refer to a piece of IP such as
a RAM, ROM, or other hard macro.  Our design flow supports n levels of
hierarchy allowing several blocks to be routed together to form a "n+1"
block, and these meta-blocks are routed together to form a chip.

Because our approach is block-based we can mix and match our P&R methods 
on a per-block basis within a chip.  We had been using Silicon Ensemble
5.2QSR3 for place and route.  In general, ~75% of our blocks benefited
from using SE's timing-driven features; though some blocks did much better
with SE's timing-driven features turned off.

Nonetheless, we've found that the difference between WLM timing and
extracted timing for challenging blocks can deviate by up to 40% of clock
period.  Because of this, we looked into PhysOpt, too.


Our Plain Vanilla SE Flow
-------------------------

  1. Start with a flat Verilog netlist for a block.  This netlist is based
     on synthesis to a custom WLM.

  2. Create a block floorplan for the block based on the chip floorplan.
     (We used some in-house proprietary tools plus PrimeTime budgeting
     and some hacks to do this.)

  3. Place the block with Cadence timing-driven Qplace.
   
  4. Connect the scan chain(s) with Silicon Ensemble's troute.

  5. Route with timing-driven Warproute.  (But don't always use the 
     timing-driven Warproute -- to our surprise, we've discovered that some
     non-timing-driven Warproute runs gave better results!)

  6. Extract timing with our in-house RC extract tool and create DSPF.

  7. Perform static timing with Primetime.  (PrimeTime had difficulties
     handling detailed parasitic info between blocks.  You can't budget
     between hierarchical pins in PrimeTime and we had to hack a
     complicated workaround to get it to work.)

  8. If needed, fix timing violations with any number of "tricks".  (The
     least of these tricks include re-routing, and often includes going
     all the way back to Step 2.)


Enter PhysOpt
-------------

We're currently using PhysOpt on a 3.7 million gate, predominantly standard
cell, 3.3 nsec design with many latency sensitive paths.  Many of our paths
come in through a chip pin, through the pad, through about 20 levels of
logic, to a register, through another 20-odd levels of logic and out of a
chip pin on the other side of the die.  These single-cycle latency paths
can traverse several blocks, making the inside the block timing quite
challenging.  It is not uncommon for a 10-level logic path in a block to be
budgeted only 900 ps to enter and exit the block.

Our goal with our Plain Vanilla SE Flow was to avoid resorting to the
iterations and the tricks mentioned in Step 8.  For 10 out of our 30 blocks
this was possible.  For 12 more blocks, we were able to meet block timing
with numerous iterations and some labor-intensive tricks.  Not good.
These 12 blocks initially missed spec by 200-700 ps.  That left 8 blocks that
kept our designers busy using time-consuming tricks to meet spec: pre-routes,
pre-placements, and even hand-routes of critical signals.  These are the
blocks that without manual intervention missed the target clock period by 
about 700-1200 ps.

Our hope was that PhysOpt was an automated approach to closing those timing
loops.  We've been using it on a block-by-block basis on our design.  Our
new flow remained largely unchanged with the exception of Step 3, where we
replaced the timing-driven SE Qplace with PhysOpt.  (We also added a formal
verification step to make sure PhysOpt kept the original functionality just
to be safe.)

My only minor gripes with PhysOpt are the data translations we had to do.
PhysOpt needs both a db and a pdb file.  Db's come from Design Compiler.
To make a pdb, you need to craft a LEF library of all the basic cells used
in your design (AND, NAND, NOR, FF's, etc.)  Then you run lef2plib.  Then
we had to hand edit the plib file because of a bus naming style problem.
Then in psyn_shell we did a "read_lib" followed by a "write_lib -o" to 
finally get the pdb we need.  This should be an automated process, not
something we have to do by hand.

For discussion purposes, I'm going to refer to three general categories of
blocks as "easy", "medium", and "hard".  "Easy" means slam dunk with
traditional SE methods.  "Medium" means that a few iterations are needed.
"Hard" means a lot of hard work and manual intervention is needed.

All of our blocks were 30K to 75K gates, with our typical PhysOpt runs
being: 30 minutes for an "easy" block; 2-3 hours for a "medium" block; and
5-6 hours for a "hard" block.

For "easy" blocks the PhysOpt command was simply:

  physopt

For the most difficult blocks we had to tune our script.  This "medium"
script example runs about 3X longer, but worked well:

  physopt -effort medium -area_recovery
  run_router -effort high
  physopt -incr -effort high -congestion -congestion_effort high \
          -area_recovery
  physopt -incr -effort medium -area_recovery


What We Found
-------------

So far, we've only run PhysOpt on 12 blocks.  There were 4 "easy", 6
"medium", and 2 "hard" block in these 12.  With PhysOpt we have been able
to make 11 of these 12 meet timing.  Only one of the "hard" blocks gave
PhysOpt significant trouble.  (Depending on how we tweaked our script,
PhysOpt placement either created an optimized netlist that was too big to
fit in the block (floorplan) -- or -- the final PhysOpt placement was
unrouteable in SE.)

On this problem child block we had to switch back to Cadence's Qplace
(in timing-driven mode and without PBopt).  Reverting back to this
labor-intensive mode, we got our problem child block within 120 ps
(extracted timing) of spec.  Closer, but still not there.

For the other remaining 11 blocks, our PhysOpt results were very positive.
The "easy" blocks remain "easy"; the "medium" blocks became "easy" (they go
right through with little or no manual intervention), and the other "hard"
block became a "medium" -- we only had to play a few tricks to meet timing.

The important thing to note here, John, was that those 11 blocks were
handled automatically by PhysOpt.  For us, hassle-free EDA == Good News.
In baseball terms, what we're seeing is 11 home runs by PhysOpt and 1 run
to 2nd base by timing-driven Qplace.

We have noticed that with the 1.0 HP build of PhysOpt, the run_router
command should not be run before timing reports -- it makes the timing
reports far too pessimistic.  (The run_router command does global and/or
steiner route estimates depending on which level you use it on.)  You must
run "physopt -incr" after every "run_router" command or you'll have some
very messed up timing reports.

We have also noticed that the first "physopt" command should not be run
with the "-congestion" flag because it puts registers in really bad initial
locations that subsequent incremental runs can't improve.  (This was our
first for STAR for PhysOpt.)

My last gripe about PhysOpt is the hassle involved with trying to pre-place
physical cells not in the logical netlist.  (These cells have no Verilog 
function but we still need them.  Some examples: clock buffers, scan logic,
and spare cells.)  We had to trick PhysOpt by creating layer 0 obstructions
in a TCL directives file for PhysOpt for these pre-placed cells.  Then let
PhysOpt place everything else around the pre-placed cells.  PhysOpt outputs
a db.  Then db2def followed our set of homebrew Perl scripts to finally
sew all the DEF files back together.  This should be automated just like
how the prior LEF/plib/pdb translation should be automated.

We have not had a chance to take blocks with embedded hard macros completely
through the PhysOpt flow.  This is next on our agenda, and it will be
interesting to see what it does with them.

These caveats not withstanding, PhysOpt has made us more productive by
keeping our "easy" blocks "easy", making our "medium" blocks "easy", and
that one "hard"  block became "medium".  It has been quite stable for beta
code, and is reasonably easy to learn (~1 week) for someone familiar with
Design Compiler and backend tools.  

We expect to tape-out in 2-3 months.

    - Dave Balhiser
      Agilent Technologies                         Ft. Collins, CO



 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)