Editor's Note: I wish to apologize for not getting ESNUG out last week.
I've had a very crappy 2 weeks. My Internet provider has had this very
ugly and subtle email bug that caused me to lose roughly 12% of the
email I received over the last 60 days. (I'll let you imagine what it's
like to lose a random 12% of two months' worth of emails on my business
life.) I paid $400 to have the brakes on my car fixed. There's still
something wrong with my car's brakes. I recently got a root canal. My
EMC stock that used to be worth $100.87 a share now trades for $13.49
a share. I just got news that some good friends got laid off. People
are freaked over catching anthrax from their mail; plus Sept. 11 ruined
not *1*, but *2* family get togethers we had planned.
And my birthday this year turned out to be (and, no, I'm not kidding)
on 'National Depression Screening Day'.
- John Cooley
the ESNUG guy
( ESNUG 380 Subjects ) ------------------------------------------ [10/25/01]
Item 1: Real Life Experiences w/ MoSys, Virage, & Artisan Memory Compilers
Item 2: Synchronicity, Clearcase, Cliosoft -- Which Is The Best RCS Today?
Item 3: Wall St. Curious About Numerical Tech, Inc. (vs Mentor vs Avanti)
Item 4: Verplex (Again) Tramples Chrysalis & Formality In User Benchmarks
Item 5: I'm Not Getting Quite The Scan Vectors I Wanted From TetraMax
Item 6: ( ESNUG 377 #18 ) PrimeTime & Its New Interface Logic Models (ILMs)
Item 7: ( ESNUG 379 #1 ) Yea, DC 2001.08 And Presto Core Dumped On Us, Too
Item 8: ( ESNUG 379 #7 ) The Cadence CTGEN vs. SPC Benchmark Was Pre-Route
Item 9: New Methodology Changes In The 2001.08 PhysOpt / DFT Compiler Flow
Item 10: ( ESNUG 379 #11 ) PrimeTime, Red Hat 7.0, Linux Benchmarks, ILMs
Item 11: ( ESNUG 378 #12 ) VCS Flags And Their Impact On Simulation Speed
Item 12: ( ESNUG 379 #9 ) PhysOpt/PrimeTime/VCS Sun/IBM/Linux Benchmarks
Item 13: ( ESNUG 379 #14 ) DC Behaves Badly Without Synch Set/Reset Flops
Item 14: We Got 24X Speed Up With The New DC/ACS 2001.08 Design Budgeter !
The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com
( ESNUG 380 Item 1 ) -------------------------------------------- [10/25/01]
From: [ Captin Krunch ]
Subject: Real Life Experiences w/ MoSys, Virage, & Artisan Memory Compilers
Hi John -
I noticed your DAC report was light in the IP memories area. Keep me anon
please, but here's some real-life information to fill people in:
1) The MoSys RAMs look like a good idea, and the people supporting them
are nice (though there aren't very many of them). However, they are
generally device people, not logic designers. Also -
a) The MoSys test suite works only if you can bring all the pins of the
RAM to the pins of your chip (and their nominal form factor is x128).
Otherwise you have to develop your own test patterns and methods of
telling the fuse burner what to do.
b) MoSys hasn't addressed the soft-error issue, so make sure you leave
plenty of room for (non-MoSys) RAMs for parity and ECC (especially
if your application is byte-wide). (It's a shame MoSys didn't
adjust their form-factors for this).
c) Fuses add to your processing and testing costs (and don't help with
soft-errors).
2) The Artisan memory compilers are pretty good. However, when Virage beat
Artisan to market for 0.15u last year, our company switched to Virage.
(Virage claimed better performance as well). Big Mistake! Virage
apparently bought its compiler technology at K-mart. Very immature
physicals, lots of bugs and problems. Artisan's compilers are much more
solid and mature. (And if Artisan and Virage are listening - can't you
guys agree on a common naming/byte-masking convention for your ports?)
I would love to hear what others have seen with Artisan, Virage, and MoSys.
- [ Captin Krunch ]
( ESNUG 380 Item 2 ) -------------------------------------------- [10/25/01]
From: Vijay Govindarajan <vijay.govindarajan@qstech.com>
Subject: Synchronicity, Clearcase, Cliosoft -- Which Is The Best RCS Today?
Hi, John
I'm interested in some revision control software. I searched DeepChip.com
site in detail on this issue. Synchronicity seems to have a lot of issues.
Clearcase seems hard to work with if you have not worked with it before.
I saw some articles on SOS from Cliosoft but these were really old postings.
Our company is looking for a tool that will support backend layout process
also. I'd like to know which version control tool is a good choice today.
- Vijay Govindarajan
Quicksilver Technology San Jose, CA
( ESNUG 380 Item 3 ) -------------------------------------------- [10/25/01]
From: Alex Woodward <woodward@mazamacap.com>
Subject: Wall St. Curious About Numerical Tech, Inc. (vs Mentor vs Avanti)
Hi, John,
Have you ever had letters dealing with resolution enhancement technologies
such OPC and PSM? If so, how could I find them? I am particularly
interested in hearing what people have to say about Numerical Technologies
vs MENT vs AVNT's solutions. If you could point me in the right direction
on this, it would be helpful. Many of your ESNUG postings are better than
most analyst reports.
- Alex Woodward, Analyst
Mazama Capital Management Portland, OR
( ESNUG 380 Item 4 ) -------------------------------------------- [10/25/01]
From: [ Mr. Bigglesworth ]
Subject: Verplex (Again) Tramples Chrysalis & Formality In User Benchmarks
Hi John.
I thought I would share this data with you. Keep me anonymous on this.
We've been looking into a formal verification solution that can handle our
big chips. From a technical standpoint, it seems like a no-brainer.
Chip 1 (RTL2Gate) 2.5 M gates, hier design:
Time Memory
-------- ---------
Avanti Chrysalis 3.0 527 min 1310 Mbyte
SNPS Formality 2000.11 No Data* No Data*
Verplex Tuxedo 2.0.8.a 6 min 302 Mbyte
* - Formality had setup problems
Chip 2 (Gate2Gate) 4 M gates, flat design:
Time Memory
-------- ---------
Avanti Chrysalis 3.0 3596 min 14870 Mbyte
SNPS Formality 2000.11 112 min 2399 Mbyte
Verplex Tuxedo 2.0.8.a 89 min 2817 Mbyte
Chip 3 (RTL2Gate) 7.5 M gates, hier design:
Time Memory
-------- ---------
Avanti Chrysalis 3.0 255 min 1300 Mbyte
SNPS Formality 2000.11 No Data* No Data*
Verplex Tuxedo 2.0.8.a 34 min 697 Mbyte
* - Formality had setup problems
Chyrsalis Design Verifyer was the most complicated to run. Formality
was also difficult to setup, but not as bad. For Tuxedo, within 30
minutes, we had everything up and running and comparing.
- [ Mr. Bigglesworth ]
( ESNUG 380 Item 5 ) -------------------------------------------- [10/25/01]
From: Don Dattani <dond@zucotto.com>
Subject: I'm Not Getting Quite The Scan Vectors I Wanted From TetraMax
Hi, John,
My question pertains to generating test patterns with TetraMax that can be
used in a gate level simulation. My design heirarchy consists of:
- a core (full-scan chain with MUXed flip-flops)
- a TAP controller and scan logic
- a bunch of other I/O
With TetraMax, I used
create_test_patterns -output core.vdb
write_test -input core.vdb -output corepatterns.v -format verilog -first
1
This generated a huge test that applies patterns at the core inputs, toggles
clocks, and checks outputs. What I want to be able to do is tell the tool
that I have a scan chain in the core so that the ATPG generated patterns
will be applicable to *BOTH*:
1) the core inputs (as applied at primary inputs)
- and -
2) the scan chain.
In simulation I imagined that this pattern would be loaded with the TAP,
then I'd toggle the clocks, scan out the patterns and validate the scan
chain and the primary outputs.
How do I get TetraMax to generate such a pattern?
- Don Dattani
Zucotto Wireless, Inc. Ottawa, Ontario, Canada
( ESNUG 380 Item 6 ) -------------------------------------------- [10/25/01]
Subject: ( ESNUG 377 #18 ) PrimeTime & Its New Interface Logic Models (ILMs)
> Has anyone used PrimeTime's new ILM capability? I'm interested in other
> people's experiences.
>
> - David Wang
> ATI
From: [ Not Dead Yet ]
Hi John,
Yeah, we having been using these ILM things for a while. For the most part,
ILMs are straightforward and easy to use. There was quite a bit of initial
confusion on how the various PrimeTime ILM options influence what logic is
kept/removed. We had to do our own matrix of experiments to figure that
out. We also have some concerns around how useful/accurate ILMs are if you
are trying to keep a handle on transition times.
Please keep me anonymous.
- [ Not Dead Yet ]
---- ---- ---- ---- ---- ---- ----
From: Ed Weber <ed_weber@agilent.com>
Hi, John,
We were using the ILM concept internally here years ago before we started
using PrimeTime. (We were HP back then.)
ILM's are a hierarchical design concept. When a child block's artwork is
complete, its timing is verified against its budgeted constraints using
extracted parasitics. When the child is timed in the context of a parent
block, the timing paths between internal registers of that child are checked
again. The idea of an ILM is to create a new netlist for a timing verified
child block which has all the internal register to register logic removed.
This netlist contains only the interface logic gates, hence the model name.
PrimeTime provides commands to write out the new netlist and a corresponding
SPEF file. Parent block timing and rebudgeting is then done using this
reduced model.
ILM's are absolutely accurate because they use the same gates and net
parasitics as the original netlist. Register to register paths within the
child block are however no longer there. So the parent must be run with the
same relevant clocks used in the child block timing verification.
Obviously an ILM model only works if there is internal register to register
logic to remove. A purely combinational child block will have the ILM
netlist identical to the original.
ILM's have allowed us to stay on 32bit workstations with 2M+ gate designs
with reasonable run times of a few hours tops. Without them we would have
been forced into using ETM's which are quite painful to produce and debug.
We've had quite a positive experience with PrimeTime's flavor of ILMs here.
It pretty much works as advertised. I've attached some data below. There
are some annoying aspects of reading multiple hierarchical DSPF/SPEF files
into PrimeTime (which is required with ILM's), but it is tolerable.
Chip A is about 500 kgate. It has 19 top level instances of 3 unique
blocks. This chip was tape released without ILM's and parts met timing.
Timing analysis runs in 42 minutes. ILM's were made for the top blocks
after tape release to verify ILM's. Timing analysis with ILM's was about
4 minutes. Results were close to identical.
Chip B is about 2 million gates. With ILM's, it is down to about 500
kgates. Timing analysis runs about 2 hours. Timing without ILM's was
never done but estimated to be about 20 hours based on similar designs.
This chip has been tape released. Parts are not back yet.
Chip C is in development. It is estimated to have 1.8 million gates. It
has 11 top level children for which ILM's were created. It times in about
20 minutes with the ILM's. Timing one of the top level children takes
about 2 hours.
I hope this helps your readers, John.
- Ed Weber
Agilent Technologies Fort Collins, CO
---- ---- ---- ---- ---- ---- ----
From: Bruce Zahn <bzahn@agere.com>
John,
We have tried using Interface Logic Models (ILM) on a few designs. We used
the ILM parasitic flow with our internal delay calculator. Our basic flow:
1 Read in the block level design and constraints,
2 Generate ILM netlist, list of cells/nets/pins in ILM, ILM constraints
3 Run a Perl script to process our parasitics to match the ILM
4 Verify the ILM generated.
We use the parasitic flow because it is more accurate than using the SDF
flow, where the block level ILM SDF is written out to be used at the top
level. This ILM SDF will have hard delays for boundary cells that will not
vary based on the block input slopes and output loads. So unless the block
level slopes and loads specified in the block level constraints match what
is seen at the top level, the delays will not be accurate.
We use an internal parasitic format (similar to SPEF RNET).
We use a Perl script that reads in a list of nets/pins/cells in the ILM and
processes the parasitic file. This script removes any object that is not in
the ILM plus do the proper 'escaping' for items in the ILM. Escaping is
needed since the ILM is a flat netlist (see below for more details). We
list nets/pins/cells in our ILM via a Tcl procedure using the get_ilm_object
command. This means long runtimes if the block is large.
Since the parasitic format used specifies a pi model for the entire net, our
delay calculations using ILMs will be correct even if the fanout differs in
the ILM as compared to the full block model. For example:
output port
FLOP1 U1 _
_____ |\ | \
| |-----| |-------|_/
| | |/ |
| | |
----- |
|
FLOP2 U2 |
____ |
| | /|----
| |-----| |
| | \|
----
When an ILM is generated, U2 will be removed since it is not in a timing
path to the output port. Therefore the pin capacitance of U2's input pin
will not exist in the ILM netlist. Since the parasitic format (similar
to SPEF RNET) specifies the pi model on the net, the capacitance of
U2's pin is incorporated in this pi model. Therefore the delay calculated
for U1 will see the correct load.
If detailed parasitic format such as SPEF DNET is used then all the pins
of the ILM nets must be retained. Use the -include_all_net_pins option of
the write_ilm_netlist command. We have not looked into this flow at this
time. It may have an effect on the size of the ILM, especially for
propagated clock signals.
ILM constraints are generated to be used during ILM verification. (That's the
process of comparing the block level ILM IO timing paths vs the full block
level netlist IO timng paths). We have not used the constraints generated
with write_ilm_script -instance for the top level, because the top level
constraints are not usually generated in a manor that seperates all the block
level constraints from top level.
Overall, we have seen good results generating ILM's for some designs. There
are some bugs/issues. Some designs are not suited for ILM's. Your blocks
must be synchronous, structured, and not too large. Here's some other issues
that we have run into:
- ILM is flat netlist - this can cause some problems!
Constraints
Any block level constraints that reference pins of a hierarchical block
will get modified when written out via write_ilm_script command. For
example, if a clock is defined on a hier pin (not good practice) it
will get pushed forward to the leaf level clock pins it drives. This
will cause the flight delay of the net to be lost.
Escaping
The parasitic file for the block needs to have any hierarchy character
escaped. We needed that Perl script that removed any object that is not
in the ILM and did the proper escaping for items in the ILM.
Even though we use an internal parasitic format, we looked at the SPEF
file written out from Design Compiler via write_ilm_parasitics that has
the hier characters escaped properly but has a bug where it escapes all
"[]" characters. SPEF has constructs to define bus delimiters, so
PrimeTime should only escape "[]" characters that are not busses.
Verification
Checking that the ILM netlist preserved the correct logic can be a pain
since the model is flat. For example, some designs have very complex
clocking structures. It is very difficult to verify that the ILM
preserved these complex structures.
- Issues with complex cells such as dual scan flip flops and cells without
timing arcs.
We have seen some strange behavior when generating ILM netlists on designs
with complex cells such as dual scan flip flops. However, these can be
worked around by disabling some of the arcs in these flops.
- Cells without timing arcs (tiehigh/low, bushold) are troublesome.
ILM generation is based on tracing the timing arcs of cells. There is
some different behavior in different releases of PrimeTime concerning
cells without timing arcs such as tiehigh/low cells. In earlier releases
of PrimeTime these cells would not get written out in the ILM netlist,
but would be in the ILM constraints as set_case_analysis command. In
PrimeTime 2001.08 they are in the netlist and some set_case_analysis
commands exist to items that are not in the ILM netlist!!! It results
in bogus errors when the constraints are applied.
- Model generation runtime & update_timing issues
ILM generation is usually fast providing the tool does not perform an
update_timing. This is because PrimeTime has problems handling
structures with many driver/load combinations, and issues a warning
(PTE-038) that performance degradation will occur. Many of our designs
have these structures. So we must ensure that we do not cause an
update_timing prior to removing these structures. So we:
1 Read in netlist and minimal constraints ( clocks, etc)
2 Generate ILM netlist via identify_interface_logic,
write_ilm_netlist
3 Remove the structures PrimeTime dislikes
4 Apply remaining constraints and generate ILM constraints via
write_ilm_script
- ILM works best on structured blocks
Blocks must be very structured for ILM generation. All block IO must be
single fanin/fanout. This is to ensure that the reg-reg paths verified
at the block level are not effected by the top level. Stuff like scan
optimization may violate to single fanin/fanout rule.
If you trace reset or scan type ports, you'll get a large ILM since these
signals typically go to all flip flops in your block. To avoid this,
these ports are marked as 'ignored' during model generation via the
-ignore_port or -auto_ignore option to identify_interface_logic command.
As a consequence, the recovery checks of asynchronous paths can not be
verified. In 2001.08, a new option "-keep_ignored_fanout" has been
added to identify_interface_logic to preserve the logic network to
boundary registers only.
- verification & dangling pins in ILMs
PrimeTime 2001.08 has been beefed up to contain commands to help verify
ILMs vs full blocks. Most of the discrepancies we've seen between an
ILM and a full block are due to cells with multiple inputs where some of
the inputs in the ILM are dangling. This causes a different output slope
calculation since most delay calculators take the worst input slope when
calculating the output slope.
To help speed up ILM verification, we write out the slope and load of
all dangling pins in the ILM. These are then applied in our delay
caculation run.
- Using ILMs in other EDA tools is difficult
In order to support other EDA tools, like our internal delay calculator,
the user may want a certain net/cell to be generated in the ILM netlist
even if it is not in the block IO timing paths.
I have requested an enhancement to have commands such as add_ilm_object
and remove_ilm_object. Commands like this would allow users to customize
the ILM model for uses in many other EDA tools. Synopsys has not agreed
to adding these commands.
- Floorplan Manager doesn't support ILMs either
We have not used ILM with Floorplan Manager. It seems like an excellent
match for ILM's. The main reason why we have not tried this is because
we have not had much sucess using Floorplan Manager (Layout Based
Optimization) to fix RC delay problems. Floorplan Manager seems to work
well for upsizing. We have had more success using physical tools to
address RC delay issues. PhysOpt is planning on supporting ILM flow.
Maybe this will help with timing closure.
Let me know if any of this is unclear. Some of the issues are hard to
explain without pictures. Overall I think ILMs are a good capability in
PrimeTime. However, ILM verification may be difficult if the block is not
structured well. We have one design with lots of clocks, many generated
clocks (and lots of MUXes), it would be very hard to verify that the
generated ILM understood the clock structure correct.
I would be very interested in hearing other people's experience with this.
By the way, I would also be really interested in knowing if others are able
to run PrimeTime in a hierarchical mode using detailed parasitics. This
involves using read_parasitics -increment to read in each block level
dnet.spef file and then the top level. I have tried this and the tool gives
a load of false errors since it has not gotten the full RC network yet.
The report_annotated_parasitic command does not work well, so you have to
write tcl scripts to figure out if all the parasitics are annotated.
The compete_net_parasitic command is also error prone and should not be a
global command.
I do not want to go on and on, but I am very interested in hierarchical
methodology. There was some of this at SNUG but it was very vague. It
also does not specify how many iterations and when the handoffs (block
level constraints, etc.) occur.
- Bruce Zahn
Agere Systems Allentown, PA
( ESNUG 380 Item 7 ) -------------------------------------------- [10/25/01]
Subject: ( ESNUG 379 #1 ) Yea, DC 2001.08 And Presto Core Dumped On Us, Too
> I want to do a rant about the new 2001.08 ver of DC. We've switched over
> to it and had nothing but problems. Tcl scripts which ran flawlessly on
> 2000.11-SP2 break for variables that should be set that aren't (such as
> "synopsys_program_name"), DesignWare license issues during initial linking
> and flattening after compiles, Presto issues, and illegal Verilog being
> written out. Ugh. I hate debugging new versions, and I think we'll wait
> to switch over when 2001.08-SP1 gets released in the very near future.
>
> - Gregg Lahti
> Corrent Corp. Tempe, AZ
From: John Pane <pane@std.teradyne.com>
John,
We experienced issues with the 2001.08 version as well. We were getting
some core dumps and such related to Presto. To get around this issue in the
short term we:
set hdlin_enable_presto false
in our .syn* setup file. This resolved our issue for now by making the
Presto stuff not run. Maybe that will help other users? Obviously if they
rely on Presto features, this won't be a good solution...
- John Pane
Teradyne
( ESNUG 380 Item 8 ) -------------------------------------------- [10/25/01]
Subject: ( ESNUG 379 #7 ) The Cadence CTGEN vs. SPC Benchmark Was Pre-Route
> I think that the slew & skew figures given in this benchmark are generated
> by the clocktree generation tools right after clock tree generation. ...
> In general, skew analysis done with a neutral STA tool like PrimeTime
> would be more acceptable. This is because, CTGEN and First Encounter may
> have different timing engines. ... If the person who did the benchmark can
> do such post-route analysis also, the benchmark would be more realistic.
>
> - Donepudi Narasayya
> STMicroelectronics
From: [ Rinse, Lather, Repeat ]
Please keep me anonymous again on the CTGEN comments.
I was the one who originally started this thread and I admit, the numbers
were pre-route. I can say that in the case of CTGEN the skew actually got
worse after final routing and extraction. We don't have a router in house
so I don't know if I will be able to provide final First Encounter numbers
unless we do purchase the tool. However, SPC does use estimated routes that
are better than what other tools seem to be using (ie: not Steiner routes),
so I think its likely SPC's skew numbers are not far off from reality,
especially given what I have heard others say about this tool and in
particular, the clock tree generation capabilities.
On CTGEN again, I gave it a problem where a large macro effectively split a
clock domain into 2 chunks of flops, one on the side of the macro and one on
the bottom. I don't think CTGEN considers obstructions at all so when it
tried to generate a clock tree on this domain it went haywire and came up
with a skew of 25 ns and over 10,000 buffers. I ended up getting good
results by splitting the 2 chunks of flops into 2 trees, generating the
trees for each chunk, and then unifying them with a simple "top" tree. This
gave good results but since I saw similar problems on other blocks with
macros I came to the conclusion that CTGEN is in no way suited to SOC types
of problems as it does not consider obstructions or routing or seeminly
anything in any decent way.
I don't see how anyone doing large chips can seriously consider continuing
to use CTGEN unless they like heavily assisted flows.
- [ Rinse, Lather, Repeat ]
( ESNUG 380 Item 9 ) -------------------------------------------- [10/25/01]
From: Vandana Kaul <vkaul@synopsys.com>
Subject: New Methodology Changes In The 2001.08 PhysOpt / DFT Compiler Flow
Hi John,
I want to alert your readers to the recent changes in the DFT Compiler /
PhysOpt flow in the 2001.08 release.
The Old PhysOpt Scan Flow
-------------------------
In previous releases of PhysOpt, once a scan replaced, physically optimized
design was created, additional steps were required to 1.) legalize any new
cells added (e.g. lockup latch) during scan chain stitching and; 2.) to fix
any timing violations introduced during scan chain stitching. This is no
longer necessary in 2001.08. Here are the old RTL-to-placed-gates (RTL2PG)
and the old gates-to-placed-gates (G2PG) flows, respectively:
Old (Pre-2001.08) RTL2PG Flow:
#Generate placed, optimized, scan replaced netlist
compile_physical -scan
check_dft
#Stitch scan chain
set_scan_configuration ...
set_scan_signal ...
insert_scan -physical
check_dft
#Verify all cells are placed on legal locations
check_legality
#Legalize the placement of any new cells
legalize_placement -eco
#Fix timing violations
physopt -incremental
Old (Pre-2001.08) G2PG Flow:
#Generate scan replaced netlist
compile -scan
#Place & optimize scan replaced netlist
physopt
check_dft
#Stitch scan chain
set_scan_configuration...
set_scan_signal...
insert_scan -physical
check_dft
#Verify all cells are placed on legal locations
check_legality
#Legalize the placement of any new cells
legalize_placement -eco
#Fix timing violations
physopt -incremental
Notice the placement legalization step required after scan chain stitching
and the incremental "physopt" run required to fix any timing violations.
The New PhysOpt Scan Flow
-------------------------
In v2001.08 of PhysOpt, the command "insert_dft" automatically performs any
necessary placement legalization and timing optimization, thereby removing
the need for the additional commands mentioned above. Following are the
RTL-to-placed gates and gates-to-placed gates flows in 2001.08:
New 2001.08 RTL2PG Flow:
#Generate placed, optimized, scan replaced netlist
compile_physical -scan
check_dft
#Stitch scan chain
set_scan_configuration...
set_scan_signal...
insert_dft -physical
check_dft
New 2001.08 G2PG Flow:
#Generate scan replaced netlist
compile -scan
#Place & optimize scan replaced netlist
physopt
check_dft
#Stitch scan chain
set_scan_configuration...
set_scan_signal...
insert_dft -physical
check_dft
The benefits of using the new 2001.08 flow are: 1) new cells added after
scan insertion are automatically placed during "insert_dft -physical";
2) timing violations are automatically fixed during "insert_dft -physical."
The New "set_scan_state" Command
--------------------------------
When starting with an existing Verilog netlist that has already been scan
replaced, the new command, "set_scan_state" should be used to tell PhysOpt
and DFT Compiler the netlist is test ready. Following is an example flow:
#Read scan replaced netlist & floorplan
read_verilog scan.v
read_pdef floorplan.pdef
#Place & optimize scan replaced netlist
physopt
#Indicate design is scan replaced
set_scan_state test_ready
#Report/confirm scan state of design
report_test -state
#Stitch scan chain
set_scan_configuration...
check_dft
insert_dft -physical
check_dft
When starting with a design that has the scan chains stitched, use the
command "set_scan_state scan_existing" instead of "set_scan_state
test_ready" to indicate the design is scan chain stitched, not just scan
replaced.
The command "report_test -state" can also be used when starting with an
existing scan replaced .db file to verify the scan state of the design.
All Scan Is Now Placement Driven
--------------------------------
In previous releases of PhysOpt, ordering of scan flip-flops within a chain
was placement driven, but partitioning of scan chains was alphanumeric, not
placement driven. In 2001.08, PhysOpt / DFT Compiler will partition scan
cells into scan chains based upon placement information for the scan chains.
The benefit is additional reduction in wire length, and therefore improved
congestion and routability.
Adding A Lockup Latch To The End Of Scan Chains
-----------------------------------------------
The "set_scan_configuration" command has a new option in 2001.08 which
allows customers to add a lockup latch to the end of the scan chain. The
command syntax to enable this capability is:
"set_scan_configuration -insert_end_of_chain_lockup_latch true"
The default value for this option is false.
I hope your readers find this information useful!
- Vandana Kaul
Synopsys, Inc. Mountain View, CA
( ESNUG 380 Item 10 ) ------------------------------------------- [10/25/01]
Subject: ( ESNUG 379 #11 ) PrimeTime, Red Hat 7.0, Linux Benchmarks, ILMs
> Basically, I'm looking to get some feedback from anyone who has experience
> of running PrimeTime on a Red Hat 7.0 machine.
>
> - Bob Flynn
> Massana
From: [ The Cowardly Lion ]
John,
Anonymous please.
I started with PrimeTime 2000.11 on a Pentium-III running Red Hat 6.2 in
March. I don't have all the machine details, but I had a 400 MHz Sun
running Solaris 7.0, and an 800 MHz P-3 running Red Hat 6.2. The runtime
almost always followed the CPU speed ratio (i.e. 2X Mhz = 2X PrimeTime
performance.) I quickly moved to the Linux version given that it halved
my run time. And my company's execs were much happier about buying
several more P-3's and P-4's as we grew, rather than shelling out for
more Suns.
I taped out a chip earlier this year on similar machines upgraded to Red
Hat 7.1. Presently, I'm running PrimeTime 2001.08 on P-III's and P-4's.
Overall, I don't have any complaints, other than the PrimeTime GUI
is *still* not working. If anyone has it running, I'd appreciate any tips.
If anyone has any experience with ILM, introduced in 2000.11, I'd be very
interested. I used it with some success in the pre-tapeout stages, but
there were bugs that left me without the confidence to use it for tapeout,
so I did a full flat netlist for final sign-off.
Synopsys gave me patches for PrimeTime 2000.11 to get ILMs working, and the
problem is supposed to be fixed in 2001.08. We're about to test that out.
If anyone else has any good or bad experiences with ILMs, or tips they can
share, that would be great.
Finally, has anyone seen any unique issues running PrimeTime on a 64-bit Sun
platform? We're about to bring one up in our compute center. Anyone
considering Itanium and Linux?
- [ The Cowardly Lion ]
---- ---- ---- ---- ---- ---- ----
From: "Russ Petersen" <russp@subasic.sciatl.com>
John,
My benchmarks for Linux earlier today includes running PrimeTime 2001.08 on
Redhat 7.1. It seemed to complete with no problems, although I did not check
the results in detail. I just looked to see that PrimeTime completed without
issues and that all the reports had been written.
- Russ Petersen
Scientific Atlanta
( ESNUG 380 Item 11 ) ------------------------------------------- [10/25/01]
Subject: ( ESNUG 378 #12 ) VCS Flags And Their Impact On Simulation Speed
> After being away from using it for four years, the VCS documentation
> hasn't changed much if at all. Finding out all of the new switches
> added or the performance adds to VCS isn't possible: the shipped
> docs still have Chronologic plastered all over them and there is a
> serious lack of any documentation available on the Solvnet site. (I
> had to resort to going through my SNUG handouts of previous years to
> find out about the latest and great issues with VCS).
>
> - Gregg Lahti
> Corrent Corp Tempe, AZ
From: Mark Warren <mwarren@synopsys.com>
Hi, John,
Gregg Lahti's letter got me concerned that many VCS customers might not be
running VCS as fast as they could. VCS does not have nearly as many switches
as DC, but it is very important to understand the affect of switches on VCS'
performance. Please post this reply to show your readers a quick overview
about maximizing VCS performance.
Over the years, VCS has added many optimizations aimed at increasing the
performance of all types of Verilog code. At compile-time, VCS must make
most decisions about how to perform these optimizations. In default behavior
(no switches), VCS will perform some optimizations, but it's important to
understand that some switches can turn on much more aggressive optimizations,
while other switches turn on debugging or PLI visibility (which hinders VCS
speed optimizations.) These differences can often make a big difference
(3X+) to the performance that you see!
VCS flags for better performance:
+rad performs aggressive optimization for RTL or (non-timing) gate
level. A *lot* of effort have gone into these Radiant optimizations.
This is the single best way to get the largest speedups in VCS.
+nospecify ignore path delays and timing checks for a functional
gate sim.
+notiminchecks just ignore timing checks
+timopt perform aggressive timing optimizations (for SDF gate sims)
-Mupdate use incremental compile to only recompile modules that
functionally changed
+nocelldefinepli+2 do not dump or use PLI access from within
library cells
+vcsd use more efficient Direct Kernel Interface for VirSim dumping
+2state convert entire design (except certain constructs) to 2 states
for more speed/capacity
VCS debug/PLI flags that hurt speed/memory performance:
+cli globally turn on interactive debug
-line allow line stepping in the interactive debugger
+acc obsolete flag to allow global PLI access
-I obsolete flag for interactive GUI debug
-RI compile and run interactive debug with VirSim GUI
-PP allow virSim VCD+ binary dumping for post-processing debug
-X* version specific flags to work around specific bugs
-P pli.tab which contains acc=rw,cbk:* globally turns on PLI access
Other tips/warnings on VCS performance:
Use the latest release of VCS. VCS6.0 was released January 2001 and it
contains many gate-level and RTL performance updates.
Occasionally using aggressive optimizations like +rad will expose race
conditions which cause simulation mismatches. Race conditions should be
avoided as much as possible. VCS comes with both dynamic and static race
utilities (+race and +race=all), plus Synopsys offers LEDA linting and rules
checkers that can help find the ambiguous coding styles that cause races.
If you don't want to use LEDA, use VeriLint, or HDLlint, SureCov, or any of
a dozen other commercial linters. The important thing is to detect and
remove those race conditions! Often +rad can give a 3X speedup for many
designs, so it is worth the effort to debug races in order to utilize +rad
for all your regressions.
VCS also comes with a profiler (+prof) that should be used periodically to
see if any blocks of your design are hogging up too much CPU time. The
output of the profiler is easy to read.
- Mark Warren
Synopsys, Inc. Cupertino, CA
( ESNUG 380 Item 12 ) ------------------------------------------- [10/25/01]
Subject: ( ESNUG 379 #9 ) PhysOpt/PrimeTime/VCS Sun/IBM/Linux Benchmarks
> The following table lists the values reported back from the Unix "time"
> command on a couple of small synthesis jobs.
>
> job cpu Sun cpu Linux PC ratio
> 1 3059 sec 1715 sec 1.8x
> 2 652 347 1.9x
>
> Seems to track the MHz scale fairly nicely.
>
> - Scott Evans
> Sonics Inc. Mountain View, CA
From: "Shannon Hill" <shannon_hill@tenornetworks.com>
John,
Here's my contribution to the growing number of Linux/Sun/IBM synthesis
benchmarks. Task: synthesize top-level design in IBM cu11, 8,669,551 cells,
121,179 instances
usertime systime elapsed cpu mem/mhz
--------------------------------------------------------------
19757.0u 83.0 sec 5:33:44 Sun Ultra60 2gb/333mhz
15514.5u 335.5 sec 4:30:54 IBM rs6000 4gb/400mhz
9833.5u 35.2 sec 2:44:32 Intel P3-1000 1gb/1000mhz *
* - we used the Abit vp6 motherboard
Interestingly enough, SUN wants $39,000 for 4 gb of memory on their new
systems, while 4 gb of ECC Registered DDR can be had for around $2000
(about $500 for a 1gb DIMM).
- Shannon Hill
Tenor Networks, Inc. Acton, MA
---- ---- ---- ---- ---- ---- ----
From: "Simon Matthews" <simon@paxonet.com>
Hi, John,
We have been running PrimeTime extensively on a 1M gate design. We had
been using our Sun Ultra-60s (450 MHz processor, 2 GB RAM) or our AX-MP
machine (400 MHz processor with 8 MB cache, 4 GB RAM).
The jobs typically took about 30-40 minutes on the workstations
We recently tested the job on Linux and won't look back.
The Linux box used was a P-III/800 MHz, 1 GB RAM (PC100). The jobs take
around 16 minutes on this machine.
- Simon Matthews
Paxonet Communications Fremont, CA
---- ---- ---- ---- ---- ---- ----
From: "Russ Petersen" <russp@subasic.sciatl.com>
Hi John,
We are considering the purchase of some Dell dual P-III 1.26 Tualatin
rackmount systems with 4 Gigs of SDRAM and we recently benchmarked them to
get some idea of their performance on real world Synopsys jobs. I thought
others would find this interesting. Here are my 4 basic tests:
test A - 400,000 gate PhysOpt 2001.08 gates-to-gates
(Memory usage: 1.3 Gig or so max.)
test B - 1.6 million gate PrimeTime 2001.08 job using DSPF's,
(uses about 1.2 Gigs)
test C - VCS 6.0 R20 simulation of RTL source code
(140 megs)
test C2 - lengthened VCS 6.0 R20 simulation of test C
(140 megs)
Machines:
Sun Ultra - 450 Mhz UltraSparc2 with 4 Gigs Ram, 4 Processors
AMD - AMD Dual Athlon 1.2 Ghz MP's with 2 Gigs DDR SDRAM
DELL - 1.26 GHZ Dual PIII Tualatins with 4 Gigs SDRAM
On all tests I was carefull not to exceed the 2 gigs of DDR on the Athlon
box as I did not want swap to play a part.
Machine Tests Running Run Time
----------- ------------- ------------
Sun Ultra test A 17 hours, 38 mins
AMD Athlon test A 8 hours
Dell P-III test A 7 hours, 31 mins
Sun Ultra test B 14 hours, 7 minutes
AMD Athlon test B 6.25 hours
Dell P-III test B 5.52 hours
Sun Ultra test C not tested
AMD Athlon test C 1.5 hours
Dell P-III test C 1.26 hours
Sun Ultra test A + test B test A: 17 hours, 42 mins
test B: 14 hours, 11 mins
AMD Athlon test A + test B not tested since combined
memory exceed 2 gigs limit
Dell P-III test A + test B test A: 7.68 hours
test B: 5.65 hours
AMD Athlon test A + test C2 test A: 9.58 hours - *
test C2: 8.43 hours - *
Dell P-III test A + test C2 test A: 9.1 hours - *
test C2: 7.67 hours - *
* - "test A + test C2" used about 1.45 gigs for the 2 tests
so swap did not come into play
Dell P-III test C2 6.15 hours
Sun Ultra test C2 6 hours, 56 minutes
Dell P-III test C2 no file dump 4.47 hours
Sun Ultra test C2 no file dump 5 hours 4 minutes
I was surprised to see that the dual Athlon did not beat out the dual P-III
with SDRAM system. I conclude that this is due to the performance of the
P-III Tutalatin procs. (See Tom's Hardware for details on this CPU; 512k
caches on them.) I just thought on dual process jobs the DDR would really
buy a lot but I guess it doesn't on these types of jobs. I assume that as
the processor speed increases the extra bandwidth of DDR would help more.
I was also surprised the one simulation job I got to run (test C2) to
compare the Sun Ultra versus the P-III did not show a greater spread. I
have seen other simulation jobs previous to this run faster on a Linux box
than on a Sun workstation, so I assume it must be something funny about
this particular sim. (I would be curious to hear others results with
simulations on Linux vs. Sun workstations.)
Also, on the VCS runs, I included the compile time so maybe its the compile
time that differed a lot. I don't know as I have not had a chance to check
this out.
Amazing. I would say you could easily get the Dell with the 4 Gigs for
about $5-6k, depending on your company relationship with them and quantity.
The AMD box would probably be (I'm guessing since 1 Gig DDR dimms were not
yet available) around $7-8k with 3.5 Gigs of DDR. The AMD had the
disadvantage of only allowing 3.5 Gigs max as it has only 4 memory slots and
one of them for some reason can't handle a 1 Gig dimm when they become
available.
Sun - I just priced a SunBlade 1000 with one 900 Mhz CPU and 4 Gigs of
memory on Sun's store website and it came to a grand total of $59,040.00.
An additional 900 Mhz CPU increases the price by about $10k. The main
advantage this machine would have over the X86 boxes is a big, big cache
(8 Megs per CPU) and big memory bandwidth. (I think the total bandwidth is
something like 3.8 Gigs per second). So I would guess that means it would
at least cost $40k for Sun equivalent of my Linux boxes (ouch!) since the
web sites prices on most companies tend to run really high.
Overall, one thing is clear, the Linux boxes performed very well, especially
given their price, and hence I am wanting to order a good number for our use
here. It also helps that most Synopsys tools are out on Linux now. I just
hope Synopsys can get a 64-bit port of PrimeTime to Linux soon.
- Russ Petersen
Scientific Atlanta
( ESNUG 380 Item 13 ) ------------------------------------------- [10/25/01]
Subject: ( ESNUG 379 #14 ) DC Behaves Badly Without Synch Set/Reset Flops
> We currently use synchronous resets in our ASIC designs. We are finding
> a lot of time is spent meeting timing on synchronous resets, especially
> at clock speeds 200+ MHz. We are considering switching to asynchronous
> resets so that the logic connected to the D pin is not impacted by reset.
>
> Can someone list the pros and cons of sync vs. async resets? Are there
> RTL versus gate simulation issues? What are the PrimeTime and test
> issues? How do you avoid them?
>
> - Siegfried Weidelich
> McDATA Corporation
From: Jon Harris <jharris@siroyan.com>
Hi, John,
We have recently taped-out a design with synchronous resets, which were
favoured over asynch resets mainly to decrease circuit susceptibility to
possible noise on the reset lines. However I did encounter one problem
in DC during the flow development:
My prefered flow was to use commands like set_driving_cell, set_ideal_net,
set_dont_touch_network, etc., on my design's reset port and then insert
a reset tree in the layout. However, what I found was a problem in DC's
elaboration stage due to the fact that some of the design's flops needed to
be RESET to logic low and some needed to be SET to logic high.
To describe the problem more fully, consider an active low reset signal,
rst_n and a block with 100 flops. 51 flops are to be SET when rst_n is low
and 49 are to be RESET when rst_n is low. The UMC 0.15 technology we were
using had no synchronously set/reset flops in the library, and so DC had to
create the reset logic by placing a gating element just before the 'D' input
of each flop.
For the 49 flops to be RESET when rst_n goes low, this gating is simply
achieved with an AND gate, and for those 51 flops to be SET when rst_n
goes low, an OR gate is used with the rst_n input inverted. It was this
latter case which caused the problem, because DC would elaborate the required
OR gating but would do it by inverting rst_n ONCE, and then fanning this out
to all 49 flops with OR gates. Similarly a non-inverted rst_n would be
fanned-out to the 51 flops with AND gates.
This approach meant for each block being synthesized there were effectively
2 reset trees, one with an invertor as its driver. Consequently it was not
possible to simply apply the "set_ideal_net" type commands and expect DC to
not put buffers in the inverted reset tree, as the invertor protected the
inverted-tree from such constraints.
So, I was stuck with all these invertors, which were needed for the
functionality but which got in the way when trying to constrain the trees.
In the end it WAS possible with some fiddling around using commands like
clean_buffer_tree to strip out all reset buffers, but I still had the problem
that the layout tool had to cope with a few dozen invertors midway down the
reset network. What I really wanted was one tree fanning out to every flop,
each flop with it's own gating element and invertor (if required).
I do not believe I would have seen this problem if UMC had had synchronously
set/reset flops in their library, but given that they did not (like other
vendor libraries that I know of) I think there is a requirement here for some
kind of "invertor proliferate" switch so that in elaboration every flop to be
SET on rst_n low gets its own invertor along with the gating element.
Although this does introduce extra logic and therefore a little extra area,
it would save a lot of time and effort on the reset tree, allowing it to be
effortlessly put in by either synthesis or layout tools without any rogue
invertors stuck in the middle.
- Jon Harris
Siroyan Reading, Berkshire, UK
( ESNUG 380 Item 14 ) ------------------------------------------- [10/25/01]
From: Lars Rzymianowicz <larsrzy@ti.uni-mannheim.de>
Subject: We Got 24X Speed Up With The New DC/ACS 2001.08 Design Budgeter !
Hi John,
I just installed the new 2001.08 release and found out that the new Design
Budgeter is significantly faster than PrimeTime. We're using ACS for our
current ASIC, and prior to 2001.08, DC called PrimeTime for generation of
gate-level budgets. That took 12 hours for our design -- longer than the
whole compile of the design. I saw some other notes about long runtimes of
PrimeTime budgeting.
With 2001.08, DC has a built-in budgeter, which seems to be much faster. It
took only 30 min for our design. That's an impressive 24X speed-up! To
enable it, you have to set a var:
set acs_use_dc_gate_level_budgeting "true"
With this set to true, DC will use its own budgeter for gate-level budgeting,
which is done for acs_recompile_design or acs_refine_design commands in ACS.
- Lars Rzymianowicz
University of Mannheim Mannheim, Germany
============================================================================
Trying to figure out a Synopsys bug? Want to hear how 11,000+ other users
dealt with it? Then join the E-Mail Synopsys Users Group (ESNUG)!
!!! "It's not a BUG, jcooley@world.std.com
/o o\ / it's a FEATURE!" (508) 429-4357
( > )
\ - / - John Cooley, EDA & ASIC Design Consultant in Synopsys,
_] [_ Verilog, VHDL and numerous Design Methodologies.
Holliston Poor Farm, P.O. Box 6222, Holliston, MA 01746-6222
Legal Disclaimer: "As always, anything said here is only opinion."
The complete, searchable ESNUG Archive Site is at http://www.DeepChip.com
|
|