Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS

( ESNUG 486 Item 1 ) -------------------------------------------- [10/26/10]

Subject: (ESNUG 484 #9) Migrating from FPGA emulation to Cadence Palladium

> I am looking to hear from any of hands-on users of Cadence Palladium who
> can share their experiences moving from a custom built in-house FPGA
> emulator onto the Palladium emulator.
>
> Considering Palladium, we'd like to check out any unexpected surprises.
>
>     - Gnanam Elumalai
>       Adaptec, Inc.                              Foothill Ranch, CA


From: Srikanth Muroor <smuroor=user domain=telegent not mom>

Hi John,

We used a combination of a multi-FPGA (HAPS from Hardi, now Synopsys) and
Palladium emulation for our last project.  For other projects in the past,
we had used the HAPS platform exclusively.  The short story is Palladium
nicely bridges the gap b/w slow-but-flexible SW simulators (~10 KHz) and
the fast-but-inflexible FPGA platform (~100 MHz) by providing speeds of up
to 2 MHz plus the visibility and turn times of a software simulator.  The
Palladium system enhances producitivity during debug when the hardware and
firmware are being concurrently developed.

FPGA's allows us many "virtual" tapeouts that enables co-verification of
hardware, firmware and host-side software before the "real" tapeout.  As a
result, the bring-up time for our chips was significantly reduced.  The FPGA
system also enables validation of signal processing and video processing
at real time (or close to real time) speeds, i.e. billion cycle validation.

FPGA's take a significant amount of work to work.  When something goes wrong
in the FPGA prototype, it is hard to zero-in on the root cause of the
problem because of poor visibility into the design mapped to the FPGA.  In
early stages of the development, when a bug is discovered, there is often a
phase of "blaming" either the firmware or the hardware for the failure and
it can take weeks before the root cause is discovered.  FPGA synthesis plus
place and route took a day for our design, and it was not particularly
useful for experimenting with speculative fixes during debug.  Sometimes
we had to deal with artificial problems caused by overflow of the FPGA
capacity, congestion in the FPGA, timing errors and electrical interfaces
on the FPGA that takes our focus away from verification of our design.

In contrast, it takes a few minutes to rebuild our design on the Palladium. 
Palladium provides visibility into the design.  Registers, nets, wires and
memories can be observed and changed.  It's emulation speed of about 2 MHz
is not as high as the FPGA's, but significantly faster than a SW simulator.
Also we no longer have to worry about partitioning across multiple FPGAs,
nor congestion, nor timing and electrical issues.

Basic FPGA vs. Palladium data:

  (a) FPGA vs. Palladium MHz.  The FPGA can run at 100 MHz after we have
      worked through partitioning, synthesis and place and route issues.
      We care about these speeds if we have to run realtime. In non-real
      time modes, we run the FPGA at whatever speed the place and route
      tools allow us.  Typically 10's of MHz.

      Palladium II runs at 2 MHz.

  (b) FPGA vs. Palladium turn times.  After we have worked out timing and
      capacity issues with our multi-FPGA platform, the total turn time is
      dominated by the P&R time of the worst partition -- which takes about
      6 hrs.  Because of the higher clock speed, the run time is much
      quicker.  But find a bug and have to respin, it takes another 6 hrs.

      To build the design on the Palladium from scratch it takes 30 minutes,
      and in the incremental mode it takes less than 10 minutes.

  (c) Visibility.  For the FPGAs is limited to the external I/Os - we have
      to plan ahead and make several signals of interest available on the
      I/Os.  Because we have an embedded processor in the design, we gain
      some additional visibility into the device registers and memory.
      Whenever there's a hard problem that needs internal visibility, we
      use Xilinx's Chipscope which functions like an internal logic
      analyzer.  However, to use Chipscope, we need to plan ahead and
      identify signals of interest before launching the FPGA build.  The
      visibility offered by Chipscope varies based on how much internal RAM
      is available (i.e. not in use by the DUT).  Typical tracedepth is
      8 K samples for the selected signals.

      On the Palladium we use something call the FVDT mode (full vision,
      deep trace) which allows us visibility into all registers, nets and
      memories in pretty much the same way like a logic analyzer does.
      The tracedepth is again limited by how much memory we have on the
      Palladium boards.  Typically we get about 150 K sample depth for
      the entire design.

Other "gotchas" if someone is considering the switch:

  (a) The biggest gotcha is the learning curve switching from FPGAs to
      Palladium.  Familiarity with Incisive (or NC-sim as most people
      know it) makes the process smooth.  Cadence provides hdlICE,
      which does not require a license, and helpful AEs to prepare the
      design for Palladium before the system is delivered.  Our design
      (written from scratch with little verification at the time) was
      up and running in about two weeks.

  (b) You will have to tradeoff execution speed of the FPGA with the
      debug visibility and turn times of the Palladium.

  (c) Replication - generally FPGA platforms are less expensive to own
      and you could provide one platform to every group responsible
      for a subsystem in the chip.  In contrast, there is a lot of
      contention for the single Palladium we own.

  (d) Lab space and 220 V supply.

Like real life, there were pluses and minuses to each approach; so we
used both.

    - Srikanth Muroor
      Telegent Systems, Inc.                     Sunnyvale, CA

         ----    ----    ----    ----    ----    ----   ----

From: Alex Hubris <ahubris=user domain=sandforce got calm>

Hi, John,

I have used Palladium, and it is a tool which can carry it's weight in gold
if you use it appropriately.  The tool works best as an In-Circuit-Emulator
(ICE).  I used it to prototype a video processing chip and was able to test
out different architecture configurations for the product because of the
way I modularized the synthesis and netlist creations.  The prototype I was
developing was able to communicate over PCI-E SpeedBridge and a real serial
console to debug and communicate to the Palladium box.

The final design ran on Palladium at about 750 kHz talking to 2 hosts PC;
one through a serial console with a Linux box and the other was a Windows
PC communicating over PCI-E SpeedBridge.

The chip was about 50 M gate equivalent -- I can't remember % in memory and
% in logic.  I was only able to fit 1/3 of our chip onto our one Palladium
box of 16 M gates.  We could have purchased more capacity to fit up to 256 M
gates in the Palladium system, but since our design was modular/scalable,
we were able to reconfigure the chip for one Palladium box -- thus saving us
a bunch of money.  (Sorry Cadence!)

Our firmware guys were able to use the box to test out real firmware and
even boot Windows drivers on the Palladium box so it can communicate over
PCI-E.

PALLADIUM SETUP GOTACHAS

So, what's the catch?  Well, it took about 2 months to setup the synthesis
and flow for modular compilation because of all the scripting that needed
to be developed.  However, the Palladium tools are TCL based and easy to
learn.  The Cadence AE's I worked with were quite helpful.  Once developed,
it's pretty much push button.

The PCI-E SpeedBridge needed to be reconfigured and instantiated to talk to
our PCS block, but again the nice thing about working with the SpeedBridge
was that the Cadence AE field guys knew it and helped with the setup.  The
SpeedBridge is not difficult to use if you read the manual first.

PROs

 - Modularizeable synthesis/compilation for fast turn around. Palladium
   tools allow you to parallelize the database creation of the netlist so
   you can precompile different parts of your design to create the final
   netlist database.  In our case, each block of the design were like
   LEGO pieces which we used to create new netlist builds for testing out
   different system configurations.  We kept different configurations of
   our system which can be downloaded to Palladium box.

 - Multiple concurrent users can test out different parts of the chip.  The
   Palladium box allowed different images to be downloaded concurrently by
   multiple users.  We downloaded different subsystems on the Palladium box
   and had 2-4 users developing and testing different subsystems: CPU,
   Console, PCI-E, etc ...

 - Good visibility for HW debug when HW problems encountered if your image
   is compiled with "full visibility".  Their full visibility is nice in
   that you don't have to recompile new signal taps or trace points to get
   visibility to what is happening in your code -- meaning you don't need
   to resimulate hours of emulation if you happen not to have the correct
   signals tapped as in the FPGA case.  The downside is a "full visiblity"
   compile (and runtime) takes much longer and creates large data files.

 - TCL scripting let us automate synthesis/compilation

 - Most Verilog/VHDL is already directly portable.  Some portions of code
   may need to be rewritten, but it is just as easy to try and compile your
   existing design as is.  We were able to completely port our design to
   Palladium within 2 weeks with help from Cadence AE.  Scripting to
   automate the process took 2 months.

 - PCI-E SpeedBridge automatically adjusts for emulation speed and keeps
   the host alive without losing connection to emulated device.

 - No partitioning to worry about across multiple FPGA's.  As long as your
   design does not fit into one FPGA, you will be better off initially
   debugging with Palladium if you're doing modular designs.  We had our
   design netlist turn around within 30 minutes because of the modularized
   way we created the compilation database.  Synthesis takes care of the
   RTL-to-netlist translation and database creation.  As long as your chip
   fits in a Palladium box, you don't have to deal with partitioning
   across multiple FPGA's.

 - You can develop real firmware on the emulated system.  Our firmware guys
   were able to develop/test firmware even running at the 750 kHz frequency.
   We could boot up our internally develop OS and run video transcoding.

CONS

 - Ramp up to learn new tool and to incorporate it into existing flow took
   a long time.  You need the buy in of management since it is a different
   flow and methodology that will need to be developed for using the tool.
   I was lucky since I was able to yield returns shortly after getting the
   design on Palladium and find a system bug right away as part of our eval
   process before getting the Palladium system.

 - You have to create new models and test benches that allow the Palladium
   box to be used as an In-Circuit Emulator.  For our usage model, I needed
   to create synthesizeable test benches in order to interface to my design.
   Since I was starting from scratch, I took this path since I know it
   yields the highest speed on the tool.

 - Speeds are in the 1 Mhz range.  We initially had our system running at
   1.3 MHz, but as we added more design pieces, the design came down 750 kHz
   for nearly 10 M gates of logic + the equivalent in memories.  As we added
   more logic, the logic steps required to implement the system increased
   and therefore lowered our operational frequency.  Although, even at the
   750kHz speed we were able to interface to a serial port on the Palladium
   box for the console interface for our design to run at 4800 baud rate.
   The PCI-E SpeedBridge automatically adjusted for the slower emulation
   speed and kept the host connection to the PC running Windows so the
   emulated device was always visible.

 - Palladium is not a free tool...  Our Palladium box had only a 16 M gate
   capacity when our design was 50 M gates.  We could have used more
   Palladium boxes given our chip size, but our modular/scalable designing
   approach saved us the need to buy them.  Not everyone can do this.

In our experience Cadence's SpeedBridges and ICE synthesis/compilation tools
were mature for Palladium.  If you are open to the ICE approach, Palladium
may be right for you.

    - Alex Hubris
      SandForce, Inc.                            Saratoga, CA

         ----    ----    ----    ----    ----    ----   ----

From: [ A Little Bird ]

Hi, John,

Please keep me anonymous if you do choose to share this.

I don't have any experience with Palladium.  However, we did argue over it
here at my company once.  One thing nobody could argue against with the
home-grown emulator is the speed.

It is very easy to say Palladium helps debug much faster with internal nets
visible.  However, with 1 MHz (or below) for Palladium versus 50 MHz (or
above) for home-grown emulator, I can reproduce the failure 50x faster.  For
the type of failure we saw here, 50x faster means 1 failure overnight versus
1 failure per 25 days.  Palladium may help debug faster, but it takes 24
extra days to reproduce one failure.  That is 3 weeks per bug.  Efficiency
of Palladium in finding bug just goes down.

I think Cadence will call me soon to prove me wrong.  :-)

    - [ A Little Bird ]

         ----    ----    ----    ----    ----    ----   ----

From: Russell Vreeland <vreelandr=user domain=gmail not calm>

Hi John,

I worked at Broadcom for 10 years where we used both in-house FPGA and
Mentor Veloce -- not Cadence Palladium.  So my comments are on the
general problem of an in-house FPGA vs. commercial emulator.

The only big technical reason to prefer an in-house custom FPGA prototype
fixture (a better description than "emulator") is sheer execution speed.  A
prototyping solution can run at FPGA speed 20 to 50 Mhz or above, depending
on how much one tries to squeeze into the FPGAs.  The commercial emulator
will run near, at, or slightly above 1 Mhz depending on design.  Problems
can arise in moving from FPGA to emulation due to this speed difference.
For example, if external HW is interfaced to the prototype it must be
capable of running at the slower emulation speed.  Most commercial emulation
vendors supply speed adapted HW for most standard interfaces.

However, speed is where the advantages of an FPGA prototype fixture end.

You have a process of importing the design into the FPGA prototype that is
not automated; you must do manual partitioning, routing, etc. and each step
of this process adds the possibility of error and can be time consuming.

By contrast, Veloce does all these steps with software supplied with the
emulator.  This compile software is very mature, having been refined over
many years.  Late in the project when the RTL is stable, the time to map to
an FPGA prototype can be tolerated, but not early in the project when there
is constant RTL revision.  RTL running on the FPGA prototype is typically
many revisions behind the current revision, which provides little value.

With FPGAs there is no real debug capability.  Once one experiences a
difficult bug to find deep in a DUT, there simply is no comparison between
the two approaches.  I have found functional bugs in chips after more than
2 full seconds of emulation time in Veloce that could be traced, waveforms
captured, a fix implemented, and re-verified in one work day.  Detecting
a functional problem is key, but the ability to effectively isolate and
correct the error is equally important.

Another great feature of an emulator is the ability to reuse a simulation
testbench.  This is accomplished through comodeling; attaching an emulator
directly to a transaction-based testbench on a host workstation.  Veloce
and its predecessors have had for 10 years.  While FPGA prototypes are
dedicated to a single design, emulators in comodeling mode are a general
resource that can be shared across multiple project teams and time zones.
Time-sharing a multi-user emulator helps close the price gap with FPGA
prototypes.  While each FPGA prototype may be less expensive to purchase,
you need many more to support the same number of users.

The difference in runtime speed with respect to interfacing external HW is
the main issue to consider when moving from FPGA prototype to commercial
emulation.  A nontechnical issue is cost.  Once those issues are dealt with
there are a host of benefits to using emulation over FPGAs.

    - Russ Vreeland, Consultant

         ----    ----    ----    ----    ----    ----   ----

From: Kasturi Rangam <krangam=user dmain=nethra not us not calm>

Hi, John,

At my company, we had been using FPGA for prototyping our design.

We switched to Palladium because of the following issues:

  1. Our existing FPGA did not have enough capacity for our new designs.
  2. Off the shelf FPGA boards did not have enough connectivity.
  3. Building a custom FPGA board would take some time.

Our experiences with Palladium after the switch:

  1. Ramp up time was very short - we got the box just before tape out
     and got it working and simulated our use cases before tape out.
  2. We took our ASIC database directly to Palladium with support of
     ASIC I/O pad models and memory models.  No support for PLL.  If
     you have synthesizable PLL it's supported as well.
  3. We needed to create synthesizable testbench and checkers to get
     maximum performance.
  4. Compile time for Palladium is small compared to FPGA compiles.
  5. We did create a custom FPGA platform which was used after the design
     was debugged for software development and interfacing with system.
  6. Palladium was complement to Custom FPGA platform for us.
  7. Due to design partitioning and size of the design, we were able to
     only get 5 MHz compared to about 1 MHz on Palladium.
  8. Virtual frequency simulations on Palladium were the same as ASIC
     frequency, so we could do bandwidth analysis.  Custom FPGA required
     us to play with frequency ratios to achieve similar use cases.
  9. We could do gate-level simulations on Palladium using gate-level
     netlists as libraries.  The speed was same as RTL.  We cannot do
     SDF simulations, though.
 10. I believe there is a way to map FPGA bitmap onto Palladium if you
     want to take that route - I am not 100% sure.
 11. There is a way to connect to system components using SpeedBridge.
     We haven't done this yet.
 12. You can dump waveforms.  The simulation performance is slower when
     you do that.  They support fsdb format only; so you need to have a
     waveform viewer that supports fsdb or convert to VCD.
 13. We did not need an engineer dedicated to do the Palladium compiling.
     Anyone could do that.

The only surprise we got was the big size of the Palladium box compared to
FPGA platform.  You need to plan for a good air conditioning unit in your
lab when using this box.

    - Kasturi Rangam
      Nethra Imaging, Inc.                       Santa Clara, CA

         ----    ----    ----    ----    ----    ----   ----

From: Chao-Lin Chiang <cchiang=user domain=plxtech got calm>

Hi, John,

I'd like to provide feedback based on our experience at PLX Technology:

 1. Time to bring up the platform is considerably shortened.  We don't
    need to build a FPGA board to emulate the DUT.  Also, the SpeedBridges
    save a lot of time compared with in-house developed RateAdaptor.  We
    don't have to spend inordinate amount of time partitioning the design
    among various FPGAs and waste cycles debugging the FPGA data base.

 2. Quick turn-around time.  Palladium has a very decent front-end
    synthesizer/compiler.  It takes less than 30 minutes for 10-15 M gates
    to complete the cycle:

          Fix-RTL -> synthesize/compile -> rerun -> verify the fix

 3. Easy to debug.  We can view all the signals in the original design
    with no extra effort to pre-specify a "watch-list".

 4. We are finding a lot more bugs early in the design cycle as opposed
    to much later.   We are finding 3-4X the number of bugs with 1/2
    the people we had dedicated to the FPGA platforms.

 5. Software development cycle starts much earlier given that engineers
    are able to work on more unstable/early databases.

Though the upfront cost is high for Palladium, in the long run it pays for
itself compared to the FPGA platforms that tend to have more maintenance
and a continuous support/cost/reliability burden.

    - Chao-Lin Chiang
      PLX Technology, Inc.                       Sunnyvale, CA

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)