( ESNUG 364 Item 14 ) -------------------------------------------- [02/01/01]
Subject: ( ESNUG 363 #11 ) Cadence PBOPT, Synopsys LBO, & FlexRoute Tricks
> As we get a good idea of the block sizes and shapes we floorplan the top
> level, with around 15 blocks & the I/O cells. We ship the top level off
> to the third party & they insert clocks at the top level. So far so good.
>
> Now they try to fix the timing on the long nets between blocks using
> QPOPT, and the results are horrendous. At this point we would've already
> generated timing models for each of the blocks, and they are inputs to
> the QPOPT runs. In many different attempts at top level timing
> optimization, QPOPT has not been able to put in an appropriate number of
> buffers/repeaters to achieve reasonable timing. I did some experiments
> with long nets and various numbers of buffers and found that I should be
> able to go 5 mm in about 1.2 nS even with a less than optimum repeater
> scheme. QPOPT isn't even getting close. When we talked to Cadence R&D
> about this they basically said that QPOPT isn't intended to do this type
> of optimization.
>
> So my question is, how are other people doing the timing optimization
> (buffer and repeater insertion) at the top level for a hierarchical
> physical flow?
>
> - Chris Simon
> General Dynamics Information Systems Minneapolis, MN
From: [ Intel Inside ]
John,
As usual, please keep me anonymous.
One challenge of any design flow is to build a methodology that works around
any weaknesses that your tools may have. This is usually not enough. You
need to do other things to help make your tools/methodologies have less work
to do. For example, it's standard practice to either flop all signal inputs
or outputs at partition boundaries to get the best synthesis results and to
help top level timing issues. I don't know if you did this for your design.
You can take this a step further and have all partition inputs and outputs
flopped. Having no combinational logic between flop boundaries at partition
edges may have some implications on your design, but will surely give your
signals more time to traverse the top level real estate. You can even go
further and duplicate your output flops to insure that all signal outputs
have a fanout of one input. This really helps solve top level timing
issues. Naturally, these have implications to your RTL, but it's food for
thought. If you can make the tools have an easier problem to solve, less
hand effort will be required.
- [ Intel Inside ]
---- ---- ---- ---- ---- ---- ----
From: "Lee Keep" <Lee_Keep@eur.3com.com>
Hi John,
I read Chris' article on ESNUG with interest as I have been working in the
area of hierarchical physical design timing closure for a few years now.
I have a few questions for my own clarification and some suggestions that
may help. You may well have tried some of these already - but here goes
anyway....
1) Timing budgets for inter-block paths
You say that your block level timing is pretty much OK - but how did you
allocate timing budgets for paths that cross between blocks during the
synthesis phase? Did the timing budget include any allowance for top
level interconnects based on your 1.2 ns per 5 mm observation?
2) Top level routing
Did you attempt any form of top level routing prior to closing the
timing of the sub-blocks? Maybe a methodology where you complete
the routing of your top level floorplan (black box sub-blocks), followed
by a parasitic extraction and propagation of the values to your
sub-blocks could help. Even a top level global route may provide a
better starting point for you sub-block constraints during synthesis.
That way, passing down some of the effort at the top level into the
sub-block which you know is easier to close.
3) Sub-block I/O buffering
I personally recommend a strategy where you insert buffers, connected to
all signal I/O ports within each of your sub-blocks. We use a dc_shell
script to do this post synthesis. These buffers should be given
priority during sub-block placement to ensure they are placed in a cell
row as close to the I/O port as possible. We use Avanti P&R and their
TDF constraint format that allows these weightings to be applied. By
choosing a sensible naming convention for these buffers you can also
highlight them post-placement to ensure the've gone in the correct
location. It's been a while since I used SEDSM but I seem to remember
something similar can be achieved. Ensure you'dont touch' these buffers
during any subsequent optimisation passes as QPOPT may well try to
remove them at the sub-block level.
This approach has helped us minimize the number of repeaters required in
the top level layout - but some of our longest top level nets still
needed some manual work.
4) Repeater insertion / ECO placement
You indicate that QPOPT can't insert enough buffers to do the job. How
densly placed is your logic? I've seen placement utilisations so high
that prevent the optimiser from inserting the number of buffers it
wants. However, I expect it's more to do with the tool running out of
steam. I also remember hearing a Cadence get-out clause that these
tools could only provide incremental timing improvements of 5-10% -- and
those were the days of PBOpt. Seems this benchmark probably holds true
today.
How happy are you that your repeaters are being placed in a sensible
location by QPOPT? We use Synopsys LBO to fix our timing broken netlist
(as opposed to a layout-engine based optimiser such as QPOPT or Saturn).
We found that LBO was able to add a sufficent number of repeaters, but
when it came to the ECO placement -- they were going in the wrong place
-- sometimes causing the timing of a particular path to get even worse.
The suggested location in our PDEF was not honoured by the ECO placer,
requiring some manual placement work.
5) Sub-block pin optimisation
When you floorplan the top level, are you doing the pin optimsation of
sub-block interfaces or is the third party? Just wondering if sub-block
port locations are as optimal as possible in you floorplan? I guess
with 15 blocks you are constrained in many directions when is comes to
this so finding the optimal solution is tough.
I don't think there's a magic solution here that will save the day yet - the
best you can achieve with many of these tools is to get the amount of manual
repeater tweaks into the ten's rather than hundreds/thousands.
BTW, what clock speeds and process geometery are we talking here?
- Lee Keep
3Com UK
---- ---- ---- ---- ---- ---- ----
From: [ A Synopsys FlexRoute AE ]
John,
I am a member of the Synopsys FlexRoute CAE team, and have been working on
top-level repeater/buffer insertion within FlexRoute for the last 6 months
or so. This capability just became available in our latest (Rev1.5)
FlexRoute release as of January 26, 2001. We have done extensive in-house
testing, and are confident in our algorithms, but I must admit that no
customer has used it on a production design yet.
FlexRoute is a gridless router, designed specifically as a top-level router
in a hierarchical system. We knew all along that repeater insertion was
critical, and have been working on it for some time.
There are two basic modes, timing-driven and length-based, each of which
I will describe briefly.
Timing-Driven Repeater Insertion
--------------------------------
1. Requires a TBEF (Timing Based Exchange Format) constraint file that
contains the following info for each top-level net:
a. driver cell name, and hierarchical RC tree representation from inside
the hierarchical block to the top-level pin of connection on that
block (usually on the edge of the block, but not required).
b. receiver cell name(s), and also hierarchical RC info as described
above, also includes a section to describe the arrival time budget
and required input slew rate.
Note: This is an ASCII format which can be easily generated with PERL
scripts etc., future FlexRoute versions will derive this info
directly from a design .db and/or PrimeTime STAMP/ILM models.
2. Requires a .db timing library database for the standard cells to be
used for repeater insertion, and the driver and receiver cells.
3. The most useful option we have found from customer feedback is a rise
time (the same as a Max. Transition DRC check in DC/PC) optimization.
The user specifies a list of inverters and buffers which can be used,
and FlexRoute will insert the inverters or buffers as appropriate (will
not change signal polarity of course).
4. The end result are legal, non-overlapping repeater locations, based on
defined DEF ROW/SITE locations. No placement legalization step is
required.
5. We have tested this on a variety of net types on large designs, including
200 pin reset/scan_enable type nets, which get reasonable solutions of
10-20 buffers, with all receiving pins meeting the rise time spec.
Length-Based Repeater Insertion
-------------------------------
1. All that is required is specification of a single inverter cell and
single buffer cell.
2. The design team must select these cells, and a specified "length" which
will meet their rise time or other timing goals.
3. This is obviously very fast, but shows good promise for top-level
repeater insertion.
4. The end result is the same as in Timing-Driven Repeater Insertion,
legal, non-overlapping repeater locations.
One of the main advantages of both of these algorithms are that they are
based on FlexRoute coarse or detailed route net topologies, which fully take
into account all routing obstacles, as opposed to other techniques that may
use simplistic Steiner estimates for routes. The end result are buffer
locations that take into account routability, with stable and predictable
results that lead to timing closure.
- [ A Synopsys FlexRoute AE ]
|
|