( ESNUG 483 Item 8 ) -------------------------------------------- [11/19/09]

From: [ Unfrozen Caveman Lawyer ]
Subject: One user's 5 week eval of CatapultC synth vs. hand coded RTL

Hi, John,

Anonymous please.  I wanted to share the results of my company's 5 week
CatapultC synthesis evaluation with your readers.  This was my company's
first use of Catapult C.  Here's what I did plus the time estimates:

  - I selected a test case design block that was a portion of our 45 nm
    modem design that I could redesign in C++ and synthesize with CatC.
    It took me about a week to select this design block (and I am not
    including this time in the project time of course.)  This modem block
    consisted of several sub-blocks which performed processing operations
    such as fractional delay, rotation, decimation, filtering, pilot
    weighting and 16 different modes of saturation/quantization.

  - Our company has its own internal ASIC libraries, and our library group
    needed to characterize them for CatC synthesis.  It them took 4 days.
    Catapult C's library builder was used in the characterization process.
    This was an important step because the quality of Catapult C's code
    depends on the quality of the characterization.

  - I installed Microsoft Visual C++ to write, debug and verify the
    functionality of this modem block.

  - Writing the C++ code itself took only 2 days.  (FYI, I had some previous
    experience with Catapult before working on this project, though not at
    this company.)  The total time was a bit longer because I spent some
    time thinking about implementing a flexible architecture instead of
    simply replicating the exact architecture.  This took about 2-3 weeks.

  - I verified the design as part of the Catapult flow.  I could do a more
    thorough verification because C++ simulates faster than RTL.  Also,
    Catapult C has its own formal verification flow that ensures that the
    generated RTL matches the behavior of the C++ code.

Results:

Input/Output lines of code:

                 Catapult C input:  350 lines C++ code
                Catapult C output:  1330 lines Verilog RTL code

Area comparison:

          Original hand coded RTL:  0.17 sq mm
                  Catapult design:  0.12 sq mm  (-29%)

Project time:

 Hand-coding RTL and verification:  4 weeks

  Using Catapult C for arch opto,
           C synthesis to RTL and
                     verification:  2.5 weeks  (-40%)

The above 29% area reduction design area reduction was not a mystery.  Our
original modem block involved a filtering operation followed by decimation.
It was not optimized for the decimation process by using a polyphase filter.
I wanted to implement the exact same design with Catapult, so I didn't
implement a polyphase filter explicitly.  However, I had enough flexibility
in my C++ implementation that I implemented an equivalent filter to
polyphase filter without even needing to use a polyphase filter explicitly.


Mentor's Complex datatypes:

I used Mentor's AC_complex class library to model my complex (real and
imaginary) signals with fixed point bit accuracy.  This class library
simplified the syntax for complex multiplication, conjugation, saturation
and quantization.  Without it, if I wanted to do complex multiplication, I 
would have to either:

   1. Do it manually ourselves, e.g. write 1 complex multiply is
      equivalent to 4 real multiply's and 2 real addition's

Or

   2. Define my own class and datatype for these complex operations.

Everything I needed was in Mentor's complex class libraries, which were
free, so I chose it.


Parallelism and HW scaling:

Our source code was written with two levels of parallelism.  One was
hierarchical, but another level of parallelism was that the design could
process 8 samples per clock cycle.  I experimented with scaling the
hardware by adjusting the clock frequency while maintaining the
same 8 sample/clock cycle throughput.

The initial design was running at 150 MHz.  I replicate the hardware to get
a throughput higher than 1 sample/cycle.  When I implemented it in C++, I
could use a 300 MHz clock and replicate the hardware only 4 times instead
of 8 times. (The neat thing here is that if you code it properly in C++, it
is done for 'free' with the push of a button and a change of a parameter.
In comparison to RTL, this would be a complete redesign.)  We replicated
the exact HW and I wrote the C++ code with enough flexibility to explore
alternative architectures.  I tried 600 MHz also.  You can see the results
below.

   Frequency   # of HW instances       Area       Area Reduction

   150 MHz     8 times                 0.120 sq     -
   300 MHz     4 time                  0.073       39%
   600 MHz     2 times                 0.052       57%

Note that the area reduction is not completely linear due to pipelining.


Where Catapult C needs work:

I found a bug in Catapult C's GUI.  Mentor has been informed of the bug.
When you adjust the clock frequency, the clock period doesn't automatically
get updated.  The workaround is that there is a different location to enter
the clock frequency where it does update the clock period properly.

Also I want to see a more complete documentation that tells you how you do
certain types of designs and what coding style to use.  In their training
class, Mentor has examples such as filters, transforms, error correcting
code, but would like to see a written reference manual.

Timing "hierarchy" was confusing to learn at first.  Catapult C has a system
or "hierarchical" mode.  Hierarchical mode means you can run different
blocks in parallel.  C++ code is run inherently sequential (e.g. line 1,
line 2, line 3), but hardware is concurrent.  CatapultC automatically
extracts this concurrency from your sequential C++ code.  If you want to run
a block in parallel with another block, you have to declare it as a separate
function.  You can assign each block to a "hierarchy" through CatapultC's
GUI or you can put a comment (pragma) in the code.  We did it both ways.


As for designing in C++ vs RTL, I found that over time you develop a feel
for what the generated RTL will be based on your C++ code.  You also get
confident that you will get a 1-to-1 mapping that you can verify against
a Gantt chart.  In my case, I did filtering and simple LUT implementation
which I was familiar with, and I could easily see the tight coupling
between my source code and CatapultC's results.  I think using their
AC_complex class library saved some time, too.

    - [ Unfrozen Caveman Lawyer ]
Join    Index    Next->Item












   
 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)