Friday, April 24, 2009

High performance ARM cores and CAD Reference methodology

It is funny what kind of deliverables 1 or 2 CAD vendors (among the top four) end up providing to customers as a reference methodology for Implementing a high performance ARM core.

1. Sometimes the RM is incomplete
2. The vendor never managed to get even within 50-100 MHz of the performance quoted by ARM (including OCV and SI).
3. The vendor used unreasonable OCV and SI targets to get to the desired frequency.
4. The methodology has formal verification Issues
5. There is no mention of a Hold Margin.
6. No signoff criteria have ever been met with the methodology at all.

Infact it is hilarious to see how they could ever possibly have achieved what ARM claims is achievable on the core. I am referring to a quick TAT ASIC EDA flow using standard TSMC libraries to achieve frequencies quoted by ARM.

ARM is an IP and whatever methodology is provided by the CAD vendor should achieve the promised frequency which ARM quotes in the data sheet or atleast be within 50-70 MHz of what's quoted in the data sheet. Extra effort and customization might be required to push it over and above the frequency quoted by ARM for an ASIC methodology, which is understandable and reasonable.

Otherwise it fails to satisy the objective of the core and the amount of money and time which a company has to invest in trying to implement the IP. Such methodologies can significantly harm an ASIC design schedule.

Engineers end up spending a significant amount of time and resources trying to maximize the frequency, minimize power and area of the core. It is our job as designer to have a good implementation of the core. But the exploration space in the CAD tools using the RM should not restrict us to points in the 0 MHz -500 MHz pareto space.

I definitely feel that this is not the case with all CAD vendors and 1-2 guys have put genuine effort in coming up with a good methodology. Guys who ping pong too much between AE and R&D are the ones I would trust the least!. ARM has proven IPs for gods sake and what is all the fuss that the CAD vendor is creating?

Do tell us what your positive/negative experiences have been if u have worked on an ARM core and if a CAD methodology/vendor has really helped boost performance of your core in the past.

It is time that ARM started promoting 1-2 vendors over others if in fact they feel that TAT can be reduced as a result of changing the vendor in order to achieve the desired performance on the core.

Although this does not satisfy the criteria of an ideal IP, who cares as long as people get what they want!

Friday, February 29, 2008

It's Raining Buffers!!

Shrinking feature sizes->increasing die sizes->faster clocks->interconnect density->reducing supply voltages->Overbuffering? (courtesy: Henny Penny).

well..I will trust henny penny and be his humble devotee. Will I end up with higher area and more power consuming silicon? The CAD guys will tell me that my buffer counts are realistic. Buffering under elmore delay model can be off by as much as 200% in comparison to SPICE. Buffer/inverter counts can no longer be evaluated as a percentage of total gate count.

what are metrics for buffer estimation which we can rely on in the 45 and 65nm realm?

while there's more to follow in this article..Industry veterans out there, let us know your thoughts/metrics you have used and are currently using to prove the CAD guys wrong.

Monday, February 25, 2008

The future of multicore.

Here is an interesting debate which happened @ISSCC over the future of Multicore processors.

A recent acquisition of AGEIA by NVidia points to a possibility where engineers could potentially integrate a physics processing engine with a GPU engine to get some cool real time graphics computations. AMD's acquired ATI. They could have plans of integrating a CPU + a GPU on to a single chip.

There are plenty of low power media processing engines who have done cool stuff like integrating ARM cores with DSP functionality to get best of both worlds (low power + performance in processing MPEG and H.264 data).

Here's an interesting link from ee times @

http://www.eetimes.com/showArticle.jhtml;jsessionid=4EKF1BULG3GFMQSNDLOSKHSCJUNN2JVN;?articleID=206105179

Sunday, February 3, 2008

Correlation issues in a CAD flow

There can be numerous issues relating to correlation across various steps in a typical ASIC CAD flow. These issues can cause timing/area/power convergence problems in meeting ASIC tapeouts.

1. Logic Synthesis:
-------------------
At logic synthesis step not much is known about wires in the design. Wires can be estimated using statistical models based on design size information and type of design and cannot be computed. The delay calculation engine makes use of either a pessimistic or optimistic wire model. Only post placement, we can back-annotate a reasonably accurate SDF/spef and re-do the logic synthesis step for more restructuring to model reality. Floorplan, which plays a significant part in timing closure is not known at Logic synthesis stage. High fanout buffering also has to be estimated and cannot be computed and delay models (like logical effort) are meant for speed (linear delay model) and not for accuracy.

NLDM and SPDM models are really not used inside a logic synthesis system as they are complex polynomial functions are too time consuming to minimize over multiple variables on complex multi million designs. There is also no real value using such accurate models (SPDM) for delay calculation as design is not really complete by any means.

There is no clock tree at this stage, so insertion delay and skew might have to be modelled based on design experience or these values could be derieved after a quick prototyping flow.

There are synthesis tools now in the market which try to read in floorplanning information as a part of the synthesis process and quickly estimate block placements. But remember that this could involve a tough loop for RTL/logic designer's with already tight RTL delivery schedules. Also, these tools might not be accounting for correct timing/congestion aware placement information.

2. Post placement/Global route:
-------------------------------
Elmore delay is a single moment delay computation technique. The model does not acount for proper slew computation/propagation. It also does not account for resistive shielding (i.e the gate does not see an effective capacitance but will see the total capacitance as load). This makes the model pessimistic. This is good news towards a global to final mode closure as all this is pessimism getting added in the design flow inherently to over constrain physical optimization algorithms.

Via's are not inserted at this stage of the design flow. Rather they could be modelled. Depending on the modelling technique used, it can effect global mode via resistance calculation. Estimating crosstalk at this stage can be very tricky as track assignment is not yet complete and coupling capacitances can only be predicted and not computed. This could be achieved by some means of virtual track assignment. But all this is going to add additional run time to GR updates, incremental GR updates and the flow.

How many tools you know of account for crosstalk pessimism at global mode stage?

A reasonable thing to do in STA here sounds like propagating the worst AT signal + worst slew forward in the timing engine so that some more pessimism is built into the physical synthesis process.

Another valuable point to look at this stage is to monitor the design congestion. May be we could club congestion metric to the x-talk metric and derive a statistical model for x-talk at GR stage of the design flow to account for extra margins during the physical synthesis step?

3. Post final route:
--------------------
Yeah I finished 1 and 2 and goto 3. Now I do whatever optimizations perimitted at CTS (useful skew), track routing and OCV optimization stages(common path pessimism removal). But I still have a lot of difficulty closing my design by a couple of hundred pico seconds.

The problem is small restructuring still can't be performed post layout (final routing) and the final route could have potentially detoured from the original global route topology in fixing the LVS/DRC violations. The chance of this happening is much less in less congested designs as final routes likes to follow the global routing topologies as much as possible. Also if the cad tool is trying to minimize (sigma (Minimize(global wire length(i) -final wire length(i}))).

Whatever optimization happens post layout is a incremental/eco kind of optimization which will effect steps like sizing, buffering, placement and routing to an incremental extent.

Eco optimization happens with most tools reverting back to an elmore delay model internally to save on computer memory and run times. Here again the slew modelling could be off if this is the case. Sizing/Buffering using an accurate delay model is absolutely essential at this stage. So tools make use of multipole models.

Most Optimiziation commands use an internal call to the extractor and delay calculator. The extractor comes up with an unreduced RC network. The job of the delay calculator is to build a reduced RC network taking in to account only the dominant poles (and zero's). This process is called MOR (model order reduction) and a reasonably accurate technique to do this is Asymptotic waveform evaluation (AWE). AWE when computed using Pade via Lanczos or Arnoldi method is a resonably accurate delay calculation technique.

Adding margins sounds like an elegant/good solution, but margins are design dependant, could end up being pessimistic and can have a detrimental effect on chip area and power.

There are so many pessimism's/optimism's involved in the entire ASIC design process. Let me know your thoughts on how you handled these CAD issues in your latest design projects as this infrmation could be very interesting and helpful.

Thursday, January 31, 2008

Design of a 50M gate ASIC..

These are 10% of the problems ASIC designers face in Physical Design and STA of a 50M gate count ASIC.

Some of the problems are:
-------------------------
1. Not many CAD tools exist out there which can handle this design flat atleast in an initial prototyping phase (synthesis + cluster placement) to arrive at the initial logical/physical hierarchies.

2. This leaves us with 50 partitions (assuming an ideal partition size to be 1 Million gates). 1 Million gates is an ideal block size which can be closed Netlist to GDS typically in < 1 day's time frame in current CAD tools. (timing closure + routing + clean LVS and DRC).

50 blocks is very very tough to manage though.

A better tradeoff.......

May be we can have 10 partitions ideally with 5M gates each. But my block run times will be higher and I have to live with those block run times (netlist to GDS might take 3-4 days or more easily).

3. K-way partitioners (k<5,6) work best if the partitions are 5-6. Not sure how good an initial seed partition is going to end up being if we have 50 of them. So I will freeze on a max 10- or worst case 20 partitions.

partitioning has traditionally been reducing pin counts as the main cost function. Any other cost function is EDA Sales innovation.

4. Reality check == Partitioning is not timing/congestion/power driven.

5. There cant be much glue logic at top level and it is a sliced floorplan. This is the simplest floorplan which is possible (minimal glue logic at top).

6. People have designed these chips routinely. But they were a bunch of very smart chip designer's sitting at IBM Microelectronics. (Not monkey's pushing CAD tools)

7. This is a 45nm chip? 200-250 mm2 die size. what is my yield gonna look like? How many times do I keep spinning this thing through the fab?

8. How am I going to reduce leakage on this chip? Power gating? Multi-Vt? V VFS? Some special-K dielectric which the fab is going to employ? All of the above?

9. The complexity is difficult to fathom if the chip has multiple modes of operation (2-3 modes is also very very hard). what if it has multiple power domains? (3).

10. How do I to generate the block level constraints? My netlist is getting generated in a bottom up fashion. Although I would love to push top level constraints down to each of the blocks, I will have a schedule slip of 6 months if I wait for my top level to finish :(

11. How do I plan my block budgeting?

12. The constraints will be quite tough to manage for all modes/corners. How do I clean up the messed up constraints? in each mode/multiple modes? across corners?
Do I use automatic constraint generators? validators? How correct are they going to be? How much time do I waste trying to see if these tools are production worthy?

13. Some of my sub chips have 200+ macro's. Are mixed placers (capable of simultaneous std cell + macro placement) going to help my woes atleast with a good initial seed placement?

14. How do I design my power plan? IR drop and EM limits, routing congestion, adequate de-cap insertion?

15. How the heck do I handle those monsterous ECO's? Reconfigurable filler cells (Metal only eco)?. what spare cell planning will help?

16. How do I do the final timing signoff? Incremental timer updates..how long are they gonna take in STA tools? Do I use ILM's? Across corners and across multiple modes? Tough to manage so many timing models.

Validating the ILM's for correctness is another big challenge.

17. How many clocks does this monster have? Clock tree (CTS) is a nightmare with multiple modes and across corners.

Also what if the clocks transcend across multiple blocks. I need to do proper clock planning early on to avoid the clock from jogging too much across multiple modules running at different voltage levels (and modes).

18. I hope my manager doesnt commit on a final netlist 2-3 week turn around time (netlist to GDS2) to the end customer.

Let me know your thoughts :)

Thursday, January 3, 2008

on routing (Routing-2)..

Within the physical design flow of a multi million gate ASIC, one of the most critical and notoriously difficult steps to perform is Routing.

To tackle this problem, it has been split up into sub problems such as

1. Global routing
2. Track routing
3. Detail routing.

One of the most critical and initial steps is global routing. The quality of this routing solution has a direct impact on

1. chip frequency
2. Area
3. power consumption
4. cycles required to complete design cycle.

Global routing takes us to a stage where signal nets are coarsely routed under a given placement solution so that wire/via spaces are allocated to each signal net.

while the objectives at this stage are routing multi terminal nets (nets with > 2 terminals), taking net criticality/slack into account, accounting for congestion, even the most simple version of the problem (i.e routing a 2 terminal net under congestion constraints) is NP complete.

Given the advances happening in VLSI fabrication technology (<65 nm), the latest advances have posed a latest set of issues to be solved and have put further pressure on global routing technology (especially handling the Non default rules at global routing stage).

Monday, December 3, 2007


courtesy:www.phdcomics.com :)