( ESNUG 393 Item 11 ) -------------------------------------------- [04/25/02]
Subject: ( ESNUG 388 #19 ) My 31% Speed-up By Hand-Tweaking DW Arithmatic
> Here's a simple datapath example and different timing/area results with
> different flows.
>
> module mux4 (m0,m1,x,b0,b1,z);
>
> parameter n=32;
>
> input [n-1:0] m0,m1,x;
> input [2*n-1:0] b0,b1;
> output [n:0] z ;
>
> wire [2*n:0] y1;
> wire [2*n:0] y0;
> wire [2*n:0] y2;
>
> assign y0 = m0*x + b0;
> assign y1 = m1*x + b1;
>
> assign y2 = (y1>y0) ? y1 : y0;
> assign z = y2[2*n:n] + y1[(n-1):0];
>
> endmodule
>
>
> All of these have been achieved using TSMC's 0.13 um technology and the
> 2001.08-SP2 release of DC.
>
> Flow Path Length Path Slack Design Area Compile Time
> -------------------- ----------- ---------- ----------- ------------
> DC-Expert + DW_Standard 13.74 -7.24 240760.28 2745.29
> DC-Expert + DW 7.33 -0.83 213677.06 3161.75
> DC-Ultra + DW + MCI + TCSA 6.63 -0.13 210615.31 2754.35
> DC-Ultra + DW + MCI + PD 6.50 0.00 174409.92 952.84
>
> DW_Standard is the standard library shipped with DC. DW is the full
> DesignWare library. "DW + MCI + TCSA" means DesignWare, with
> dw_prefer_mc_inside set to true and transform_csa command. "DW + MCI
> + PD" means DesignWare, with dw_prefer_mc_inside set to true and
> partition_dp command
>
> - Oliver Meisel
> Synopsys, Inc. Mountain View, CA
From: [ Papa Smurf ]
John, anon pls.
I have quite a bit of experience building arithmetic units for graphics
chips, mainframe processors and DSPs. I ran Oliver Meisel's example using
a "DC-Ultra + DW + MCI + TCSA flow" and found that it could meet 6.0 nsec
using a similar library which also targets the TSMC's 0.13 micron process.
I set max_fanout to 20, ungrouped all and set the operating conditions to
worst case military. I used -map_effort high and max_area 0. I'm using
2001.08-SP1. (Be sure to use SP1 or later if your using the transform_csa
command as *bad* logic can result otherwise.)
I then synthesized a version which instantiated hand optimized multipliers
and adders. I call these results in the chart below "RTL-1". This design
made 5.5 nsec and was smaller than the DW implementations.
To further improve performance I reorganized/re-architected the code,
duplicating an adder so that the magnitude compare and last addition were
performed in parallel. These results are listed as "RTL-2" in my chart.
I went from:
assign y2 = (y1>y0) ? y1 : y0;
assign z = y2[2*n:n] + y1[(n-1):0];
to:
assign y3 = y0[2*n:n] + y1[(n-1):0];
assign y4 = y1[2*n:n] + y1[(n-1):0];
assign z = (y1>y0) ? y4 : y3;
This design (also using my hand optimized multiply-accumulators and
adders) achieved a 4.5 nsec timing.
Flow Path Length Area
-------------------------- ----------- -------
DC-Ultra + DW + MCI + TCSA (original RTL) 6.0 ns 150,805
DC-Expert "RTL-1" (hand optimized arith) 5.5 ns 118,097
DC-Ultra "RTL-2" DW + MCI + TCSA (re-arch) 5.0 ns 157,753
DC-Expert "RTL-2" (re-arch) (hand op arith) 4.5 ns 129,329
Note that the big gains are from architectural changes. The transform_csa
command is a powerful architectural tool which saves significant area and
delay. My hand optimized multipliers and adders also utilized a carry-save
architecture and saved about 0.5 ns and significant area over the DW
implementation, but reorganizing the code had an even larger impact. I went
from Oliver's 6.0 nsec down to 4.5 nsec overall. That's a 25% speed-up.
All of these runs met the path length timing constraint listed.
- [ Papa Smurf ]
|
|