Better Choreography Required For Complex Chips

Just choosing the best tools and technologies is no longer enough. It now requires extensive and extremely tight coordination.

popularity

The rapidly growing number of features and options in chip design are forcing engineering teams to ratchet up their planning around who does what, when it gets done, and how various components will interact.

In effect, more elements in the design flow need to be choreographed much more precisely. Some steps have to shift further left, while others need to be considered earlier in the planning process even if the work happens later in the flow. Otherwise, problems can crop up anytime in the design cycle that are more costly to fix, and schedules may be delayed that can leave some resources idle while others are scrambling to meet deadlines.

This planning involves people, tools, and technology, and it becomes more challenging as device complexity increases and as demands grow for increased reliability and more customization.

“When I started my career, it would take us three days to do the physical design part of the chip, and you’d be doing something else at the same time,” said Graham Curren, CEO of Sondrel. “Nowadays, it will take a team of 50 to 100 people 18 months to do the same job. The scale of the teams is hugely different, which is reflected in how you manage them. If you’ve got a team of one or a team of five, you can manage that team yourself. If you’ve got a team of 25, you suddenly create another level of hierarchy. If you’ve got a team of 125, you’ve got yet another level of hierarchy and management structure.”

Coupled with that is the complexity of the designs themselves. Curren recalls designing chips from gates. “Nowadays, it’s rare not to have six or eight microprocessors, a few GPUs, a neural network, a bunch of HDMIs and USBs, and all sorts of other things. Each one of those is very complicated in its own right.”

This complexity is inevitable and cannot be designed out. While there may be no way to reduce it, there are ways to manage its effects.

“You must constantly iterate,” said Michael Lafferty, Solutions Group director at Cadence. “The days of picking a tapeout candidate design and pushing it through each step and ‘locking’ down the result before moving to the next step are in the past. The complexities of the modern large-scale designs and the tools require an iterative approach for predictable improvement and therefore time to closure.”

That iterative approach needs to be factored in early in the design cycle, as well, when it’s easier to fix problems. “As we go from technology node to technology node, where feature sizes are getting smaller, the foundries have added more checks to make sure that designs are manufacturable,” said Michael White, product manager for Siemens EDA’s Calibre tool suite, who advocates shifting left as much as possible into the initial design phase. “From the 28nm node to the 3nm node, the number of checks needed is 7 times larger. In parallel with that, the types of analyses that need to be done also have changed dramatically. In prior technology nodes, there were three major types needed for a design. Now there’s a growing handful of them.”


Fig. 1: Cycle time can be reduced by shifting left. Source: Siemens EDA

Adding to the complication, there are new design choices emerging as chips/chiplets are stacked, and new issues created by that.

“Am I going to do my design as a traditional two-dimensional SoC, where everything’s on that one die?” White asked. “Or will I have elements of that packaged as separate dies in a 3D-IC? And if so, how do I divide things up? How do I want to stack them?”

Those questions lead to additional concerns about how to handle thermal dissipation and mechanical stresses in a stacked “sandwich” design.

“How you deal with heat and get it away from what’s in the middle of that sandwich becomes an important consideration,” White noted. “The stacks induce stresses on each other as you’re building your sandwich, which creates new things that need to be considered as you’re doing floor-planning, final sign-off, and verification so that it’s going to work as expected.”

EDA and multiphysics simulation tools model those stresses, but there’s still a risk of design errors. Design engineers have to remember to include all possible parameters into the tools, and tool vendors need to flag missing items and prompt further exploration.

“Depending on what information is missing, we will warn/error/fatal, indicating what the problem is and what to check for,” said Tom Taylor, training and development specialist at Ansys. “These types of analyses pull data from many sources, so getting all the collateral correct is critical. The way we also address this is by generating checking reports for the different input data which can flag missing or inconsistent data. We add to these checks with increasing sophistication as the input data issues get resolved. For example, for missing vias, we start with simple file issues and missing data, user fixes, then start looking at design issues such as missing connections, missing vias, and others prior to actually running the electrical analyses.”

The goal is to find the problems quickly, fix them, and then move to more sophisticated, more expensive analyses. Those analyses are done when the data is good, and used to find more subtle problems and interactions, said Taylor. “Why not go straight to the sophisticated analyses? While they will find all those design issues due to their sophistication, they consume a lot more time and compute resources. That is a waste for very simple issues which can be found much more cheaply.”

While this is an example of what’s possible, it’s also essential when selecting tools to understand what the tool’s true and full capabilities are. The engineering team must determine if the tool will catch or anticipate mistakes, and if so, how it prompts corrections.

Further, foundries coach customers to consider all possibilities, including different process and packaging options. For example, an analog/mixed signal company seeking to develop a new chip for an automotive braking system may be asked such questions as, ‘What is your market? What is the technology node you’re targeting? What are the key performance metrics that you’re trying to achieve?’ Instead of one big SoC, could it make more sense to decompose it into functional parts.

In response to the answers, the ASIC architect might be encouraged to design a chiplet. Such an approach would allow the digital logic to be at the most advanced node, in order to take advantage of the power and performance benefits, and the analog portion or less essential functions could be built at mature nodes. That provides cost savings and more room for refinements. Then, all of that will be pushed through to sign-off.

Curren believes designers need to think more holistically, because many problems may be systems problems, not merely device problems. “What appears to be the same question may require different answers,” he said. “If you’re looking for reliable operation of your system, you can either make sure that the chip in it is foolproof, or you can put two chips in and make sure you can switch between them or some other solution. So these become problems that can be dealt with at lots of different levels. That opens up the discussion of not just how to make my chip reliable, but is making my chip reliable actually the right thing for the system? Or, is there a different way we should be managing this?”

The goal, as always, is to create the optimal design as quickly as possible with a cost point acceptable to the end customer. “For optimal implementation of any design there has to be an appreciation, from concept to fielded and working system, of every step and limitation to design, implementation, test, verification, lifetime, test, and the list goes on,” said Cadence’s Lafferty. “As design complexity grows the importance of recognizing and designing for all limitations from beginning to end only becomes more challenging. Utilizing all the options in modern software offerings is a critical piece to fielding the best possible system in the least amount of time.”

Verification raises complexity
In the case of verification, complexity can raise the risk of not catching a bug early, which can cause an expensive respin.

“A bug can be completely missed and a defective device sent to the customer,” said Taruna Reddy, staff product manager for the EDA Group at Synopsys. “To avoid those mishaps, verification must be done at several stages in the design process, like RTL, gate-level, and using different technologies like formal, static, and RTL/gate-level simulations. Each of these technologies must be efficient in terms of resource utilization, ease-of-use, and performance to catch as many bugs as possible in the least turnaround time.”

It also may require reminding management that time and money spent on verification is critical to increasing reliability, especially as the stakes grow higher.

“If your phone reboots, it’s annoying,” Curren noted. “If your car reboots in the fast lane of the highway, it’s far more than annoying.”

Reliability must be manageable
Demands for reliability are affected by other issues in addition to complexity. There’s been a change in focus from relatively short lifetime products where unreliability is manageable, to long lifetime products where reliability is not manageable. “There’s a lot of work being put into managing safety-critical aspects and reliability in general,” said Curren. “You’re building in self-checking, redundancy, failure mechanisms, verification. There’s a lot of new techniques coming in. But they are difficult, and they add a lot to the unit price. You need to be quite smart about how you use them and where you use them.”

What all of this leads to is yet more design complexity to manage, because of the large amount of data it produces, which in turn must be interpreted.

“Just storing that data is its own novel challenge,” Curren said. “The whole infrastructure problem is relatively new to chips and creates its own risks. Today, making sure that you’ve got things like backups for databases is really hard. There’s a risk management problem people don’t think about. How do you backup a terabyte of data that changes every single night? You can’t back it up on a tape, and you can’t send it over the internet into the cloud. So what do you do with it? These sorts of problems we just didn’t have 10 years ago.”

Rigorous communication is key
Also on the required list for managing ASIC design risk is honest, two-way communication, with precise questions and detailed answers. Especially with larger teams, which is frequently the case for multidisciplinary projects, it necessitates meticulous definition of workflows.

The design itself should guide the management approach. “Physical structure as well as power domain planning are key pieces of the design to be thoroughly understood at the architecture phase,” Cadence’s Lafferty said. “Every module, power domain, and physical block must be fully understood before the design coding begins. This will naturally divide the design in ways that can be distributed across larger design teams to safely return quality designs on time.”

The details are critical. “This involves mapping out all receivables and deliverables to ensure a smooth and coordinated collaboration among team members,” said Cedric Mayor, CEO of Presto Engineering. “Facilitating a ‘lessons learned’ session is vital for continuous improvement and to enhance workflows and methodologies.”

In addition, at every step of the process, it must be ensured that members of different teams have not made different assumptions in their approach to a joint project. It’s textbook for most engineers that one of the most expensive disasters in NASA history, the crash landing of the $125 million Mars Climate Orbiter, occurred because one side had used metric and the other imperial units.

“Managing the data hand-off between geographically distributed teams is key so there are no wasted cycles in the effort completed by different teams,” Synopsys’ Reddy said. “This can be managed by communication, smart and collaboration utilities built-in to the EDA products, and revision control.”

In this sense, reliability doesn’t just apply to hardware. There must be professional partners who are reliable collaborators, too.

“Every chip is going to have a different process, different libraries, different clocks, different ways of writing the specifications, different languages and different interfaces. It’s all going to be hugely challenging,” said Curren. “A way of limiting the risk is to make sure you have a good relationship with your supplier — one who is able to work very closely with you. And when you feel that you can share your own aspirations and concerns and business objectives in an open way. I think where things go wrong is where communication is poor, or where things get hidden or glossed over.”

To find a good partner, he recommends asking companies such as TSMC or Arm, which have very recognizable brands and have no reason to be biased, who they think is good. “After that, a lot of it really just comes down to starting to work with the team and establishing that relationship and checking that you can have open and honest conversations. Do they understand your business? Do they understand your business objectives? Or are they just trying to sell you what they’ve got? It has to be a really open conversation because sometimes there’s stuff in there that people don’t like to think about, but which can bite you later on. Nobody ever likes in the sales cycle to mention the word ‘failure.’”

Designing chips has always been complicated, but there are so many moving pieces today that the various steps and processes need to be much thought through much more carefully. “To ensure the success of a complex ASIC project, the initial stage involves employing a top-down approach to identify the specific use cases where the ASIC will be utilized,” Presto’s Mayor said. “By thoroughly comprehending the requirements of each use case, it becomes possible to accurately extract the necessary feature set required to fulfill those use cases. Once the feature set has been determined, the next step involves specifying the physical blocks that need to be implemented. The subsequent step involves meticulous planning and execution. In this regard, selecting an ASIC partner with a well-established and validated development flow is of utmost importance.”

Conclusion
In complex ASIC design, all of the considerations add up to further enticements to shift left, Siemens’ White said. “Classically, you’re only running the global sign-off toward the end of each step. But if you run those solutions earlier, as well as shorten how long it takes to run them, you’re finding errors early in the process versus finding them at the end. This de-risks from a schedule perspective, because you’re finding out earlier if your design is manufacturable or not.”

Nevertheless, time must be used smartly. “Do not shift too far left, too early,” White warned. You don’t necessarily want to run a 10,000-check DCR deck on everything because the design in these early stages is not complete. Running the entire deck can take many hours and produce results that aren’t actionable errors.”



Leave a Reply


(Note: This name will be displayed publicly)