( ESNUG 482 Item 6 ) -------------------------------------------- [06/30/09]

From: Ningyi Xu <xu.ningyi=user domain=microsoft hot calm>
Subject: Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW

Hi, John,

We purchased AutoESL's AutoPilot in 2008 to implement some of the time-
consuming cores in our software into FPGA hardware for the runtime speed-up
improvements.  We found this can often accelerate our SW runtimes by 2-3
orders of magnitude.  The AutoESL C-to-RTL synthesis tool claims to support
both Altera and Xilinx FPGAs, as well as ASICs, but we only tried it on
Altera Stratix II's.  Our software:

  1. RankBoost - a machine-learning algorithm used in the dynamic ranking
     of search engines.  RankBoost is several thousand lines of ANSI C
     with the synthesizable time-consuming being 149 lines.

     We used AutoPilot to generate RankBoost's core computation logic and
     integrate it to  existing interface IP cores like the DDR2 controller
     IP core.  AutoESL utilized common megafunctions for the target devices
     and automatically generated the Avalon bus interface.  The final
     implementation had about 12000 ALUTs on the Altera Stratix II FPGA.

  2. Sorting Algorithm - also several thousand lines of OO C++ code with 138
     lines that needed speeding up.  AutoESL again utilized megafunctions
     for our Stratix II's.  Additionally, we used AutoPilot's "APInt", or
     arbitrary precision integers.  AutoPilot has a source-level simulation
     utility for APInt, and the resource usage depends on the size of the
     array in the processing engine used.  For example, when the sorting
     process engine array is 128, the synthesis result shows a total ALUT
     of 11346.

In general, AutoPilot takes a high level description of a design in ANSI C,
OO C++ or SystemC as input, and synthesizes it into Verilog/VHDL RTL code.
AutoPilot also automatically generates a vector-based testbench from the
C/C++ level testbench for users to use to verify the design.


ANSI C vs. OO C++ vs. SystemC

AutoPilot supported all the ANSI C and C++ language constructs we required
it to support implementing our algorithms in hardware.  Our standard C/C++
function parameters were synthesized into various handshaking, memory,
streaming, and bus interfaces.  We didn't test all the language features
that AutoESL claimed to supported, but I believe our two cases covered the
most commonly used language features that we may use as input to AutoPilot.

For our RankBoost, we used ANSI C.  In the Sorting Algorithm, we used C++.

  1. We wrote our RankBoost design in ANSI-C. (C is simple and compact).

  2. We wanted to implement the Sorting Algorithm using an object oriented
     style code; since ANSI-C is not object oriented, we used C++ for it.
     We wrote the Sorting Algorithm to take advantage of a couple of C++
     features, including classes and templates, so that the code itself
     would be more generic and reusable.  For example, our data elements
     types could be easily configured with template parameters.

  3. We used AutoESL's APint data type (arbitrary precision integer data
     type) for the Sorting Algorithm.  APint is supported in both C and C++,
     but the implementation of APint in C++ using templates was easier,
     since AutoESL's C++ APInt is also a templatezed class.

We never tested the object-oriented C++ code in AutoESL; we had committed to
one particular Sorting Algorithm (odd-even sort) with fixed data type for
the implementation, and OO was not a must-have for this purpose.


Design exploration:

One aspect of AutoPilot is that its fast runtime allowed us to do in-depth
explorations of the design space.  For RankBoost's core computation logic,
we investigated different performance/area tradeoffs while doing quick
retargeting from one ANSI C source to 2 different FPGA tech libraries in a
Altera Stratix-II:

        Design     Mem     FP      Reg    Logic   Latency
        and Lib   (bits)  adder    FFs    ALUTs   cycles    MHz
       ---------  ------  -----   ----    -----   -------  ------
       AutoPilot   128K     8     7911    5886     19M     140.55
       (8 PEs, 
       XtremeData
       floating
       point lib)

       AutoPilot   128K     8     6295    5295     19M     107.03
       (8 PEs
       Altera FPU
       lib)

       AutoPilot   144K    12     9999    9706     14M     105.49 
       (12 PEs, 
       Altera FPU
       lib)

Hand-generated code vs. AutoPilot generated code:

       Hand-coded  128K     8     5373    5523     19M     125.00
       RTL

       AutoPilot   128K     9     5453    5316     19M     125.00
       (final
       design)

AutoPilot's RTL code generation time for the core SW in RankBoost was only
about 1.5 minutes -- near-zero compared to our time to hand-code RTL.
Because of AutoPilot's fast synthesis time, we did additional design space
exploration to select the best configuration for the most optimal design.
We were able to get a QoR comparable to hand-coded RTL yet we still saw a
75% project time savings.

        Manual RTL creation time, including verification: 2 months
     AutoPilot RTL creation time, including verification: 2 weeks

The above time to create RTL with AutoPilot included 5 major revisions of
our C code for RankBoost.  We had cropped the initial code from RankBoost's
software implementation, and found the original coding style could be more
efficiently written for C synthesis implementation and optimization.  We
had two kinds of modifications on RankBoost:

  1. Modifying the ANSI C code for better C synthesis.  For example, the
     major body of our code was initially written in the main() function.
     For synthesis, we wrapped the code into a separate function in main(),
     with this new function specified as the top module to be synthesized.

     We also made changes to the parameters of the function and assigned
     the interface type to the input and output as the following shows:

           void foo(float * mem_data,
                      volatile uint64 * input_dataport1,
                      volatile float * input_dataport2,
                      int size,
                      volatile float *output)
           {
           #pragma AUTOPILOT INTERFACE fifo port=input_dataport1
           #pragma AUTOPILOT INTERFACE fifo port=input_dataport2
           #pragma AUTOPILOT INTERFACE fifo port=output

           //major body of the code here

           }

     Note: The "volatile" pointer type is needed to specify a FIFO.  If a
     pointer is marked as volatile, the compiler won't optimize the number
     and order of its read and write accesses.

  2. Modifying C code for improving code optimization.  For example, in our
     initial code, we had

                      for (j = 0; j < 255; ++j)
                        {
                          k = 255 - j;
                          fHisto[k - 1] += fHisto[k];
                        }

     This piece of code was used to build an integral histogram from a
     256-bin histogram.  We had thousands of histograms to be processed.
     Each histogram is stored in an array declared as

                      float fHisto [256];

     Since the floating point adder in Altera's megafunction library needs
     7 to 8 cycles to output the result and there is a read-after-write
     dependency, the addition operation could not be fully pipelined in the
     above code.  To remove bubbles in the pipeline, we put 16 histograms
     together:
                      float fHisto [16][256];

     And then processed them in an interleaved manner:

                      for (j = 0; j < 255; ++j)
                        {
                          for (i = 0; i < 16; ++i)
                            {
                              #pragma AUTOPILOT pipeline II=1
                              k = 255 - j;
                              fHisto[i][k - 1] += fHisto[i][k];
                            }
                        }

     Notice that we used a pragma to specify the loop pipelining interval.
     To boost data-level parallelism, we implemented 8 more pipelines since
     the histograms are independent of each other:

            float fHisto[8][16][256];
            for (j = 0; j < 255; ++j)
              {
                for (i = 0; i < 16; ++i)
                  {
                    #pragma AUTOPILOT pipeline II=1
                    for (k = 0; k < 8; ++k)
                      {
                      #pragma AUTOPILOT unroll
                      fHisto[k][i][255 - j - 1] += fHisto[k][i][255 - j];
                      }
                  }
              }

     This code was then synthesized to a 8-way SIMD (Single Instruction,
     Multiple Data) engine.  Through these code changes, we avoided the
     bubbles in the RankBoost pipeline, reduced the latency, and fully
     utilized data parallelism with an 8-way SIMD architecture.

     I would like to mention that we could easily change it to a 16-way SIMD
     by simply adding and modifying a few lines in the RankBoost C code.

On our Sorting Algorithm, the generated logic from AutoPilot was so close to
our theoretically optimal results that we saw no reason to implement it
manually for comparison purposes.  We just used AutoPilot's RTL.  So I don't
have hand-code RTL vs. AutoPilot RTL data for the Sorting Algorithm.


The set-up and learning curve for AutoPilot:

It took us less than 1 day to set up the AutoPilot environment for the first
time, and only several minutes for the follow-on designs.

In the early stages, our design methodology was an iterative loop between
constraining AutoPilot synthesis and results analysis with its built-in
Control Data Flow Graph (CDFG).  Later, we started with the targeted micro
architecture in mind and then we created the C/C++ code plus corresponding
synthesis directives.  So it was important to our implementation to be
familiar with AutoPilot's directives.  Here's our ramp-up for AutoESL:

  - 1 to 2 days for onsite training on AutoPilot: basics, methodologies,
    tool setup, hands-on tutorials.

  - 1 to 2 weeks to begin with your own design and learn by doing.  In
    our case, we did this was our RankBoost project.

  - 3 to 4 additional weeks to try out AutpPilot's other advanced features
    like: simulation, integration with SoPC, customized IP, floating point,
    advanced language optimization, etc.  This process may take some time
    while I also prefer a "learning by doing" style because some advanced
    features will only be adapted in special cases.  We did this with our
    Sorting Algorithm project.

So, overall a hardware designer experienced in RTL simulation and synthesis
should expect to spend 6 to 7 weeks getting ramped up on AutoESL.  Much of
this depends on how deeply they want to learn its advanced features:

  - Controls.  Our users control results in several ways, including adding
    synthesis directives to control pipelining, interfaces, and memory
    using Tcl commands or pragmas or the GUI.

  - GUI.  AutoPilot has a GUI for users to understand the generated logic.
    For example, it has a schedule viewer to visualize the scheduling
    result and a report view so you can easily compare QoRs for different
    implementations.

  - Floating point synthesis.  We used single-precision float type and
    floating point adders for RankBoost; AutoPilot fully supports these
    standard single- and double-precision floating point data types for
    Altera platforms.  We could directly synthesize common floating-point
    math routines such as square root, exponentiation, logarithm, etc.

  - Loop and hierarchical function pipelining.  AutoESL's loop pipelining
    allows multiple successive iterations of a loop to execute in parallel
    by initiating one iteration before the previous one has completed.
    This can optimize the design for both loop throughput and latency.

  - Power reduction.  AutoPilot's optimization also includes various
    transformations for power reduction, including Operation Gating, MUX
    optimization and reduction, FSM coding, pipeline register gating, clock
    gating as well as using given Multi-Vdd assignment.  We don't pay much
    attention to power consumption with our current FPGAs so we didn't use
    this power functionality, but it's an important feature and we would
    like to try it out in the near future.

  - Interface Synthesis.  Our designers use AutoPilot's standard function
    parameters to infer the desired inputs and outputs to the environment
    rather than hand code any target-specific interface timing behaviors
    into our C/C++ source.  AutoPilot's interface synthesis converts the
    parameter reads and writes into the actual interface accesses.  The
    direction of the data transfer is inferred from the way a parameter is
    used in the function body.

    For example, based on the specified communication interfaces in the
    platform library, a store operation on a scalar pointer (e.g., *p = x)
    can be turned into a direct wire connection, or a FIFO write, or even
    a bus write transfer.  This helps tremendously to keep our designers
    away from the "devil-is-in-the-details" of the target platform and
    focus more on developing the functional/algorithmic part of the design.

    Currently, AutoPilot supports the following types of interfaces:

          - Wire interface,
          - Buffer interface,
          - Memory interface,
          - FIFO interface,
          - Bus interface

    The user can control the selection of interface with a few pragmas.


AutoPilot's negatives:

  - It needs better user interface with its CDFG.

  - Needs better 3rd party tool chain support.  It took us a while to
    setup the whole tool chain including our ModelSim RTL simulator and
    Altera Quartus II FPGA implementation tools.

  - AutoESL claims that AutoPilot supports Altera's SOPC builder tool
    and Avalon bus interconnects.  However, we did not test these.

  - It needs better Verilog support.  AutoPilot includes some libraries
    written only in VHDL, for example a few platform-specific bus
    interface adaptors are generated only in VHDL.  It would be better
    if the Verilog version was generated as well.

AutoESL's technical support was professional and they covered the product,
integration into the design flow, and language.  We gave AutoPilot a first
look in 2007 and it's been delivering major features for FPGA-based design
in its recent releases.  This tool can produce very acceptable results in
a very short time.

I give AutoPilot a score of 4 out of 5 possible and would strongly recommend
it to others.  

    - Ningyi Xu
      Microsoft Research Asia                    Beijing, China
Join    Index    Next->Item










   
 Sign up for the DeepChip newsletter.
Email
 Read what EDA tool users really think.


Feedback About Wiretaps ESNUGs SIGN UP! Downloads Trip Reports Advertise

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley.  All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |

   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)