Challenges For New AI Processor Architectures

Getting an AI seat in the data center is attracting a lot of investment, but there are huge headwinds.

popularity

Investment money is flooding into the development of new AI processors for the data center, but the problems here are unique, the results are unpredictable, and the competition has deep pockets and very sticky products.

The biggest issue may be insufficient data about the end market. When designing a new AI processor, every design team has to answer one fundamental question — how much flexibility is required in the product? Should it be optimized for a single task or more general workloads? A continuum of solutions exists between the two end points, and getting that right for AI hardware is more difficult than many solution spaces of the past, especially for data center workloads.

Many factors need to be balanced. “It comes down to being able to drive the economics around being able to design and manufacturing devices,” says Stelios Diamantidis, director of AI products and research at Synopsys. “That has to be done within the time available and the cost factor that you have available.”

This immediately starts to narrow the potential markets. “When do you make money off a custom chip? When there is volume,” said Susheel Tadikonda, vice president of engineering at Synopsys’ Verification Group. “If I’m trying to build a custom chip for data centers, how many data centers are there? You may be able to sell the chip for a premium, but that’s not enough. If I were to build a chip for a consumer device, we could be talking about billions of devices. The volume is there. That’s where you’ll see a lot more money being made from these ASICs, because they definitely need volume. Otherwise it doesn’t cut it at all.”

But that does not address the issue of where the chip may fit in the custom-to-full-programmable continuum. “As you get more customized, as you create a chip for a very specific algorithm, it will be more energy-efficient and it will have better performance,” says Anoop Saha, senior manager for strategy and business development at Siemens EDA. “But you have sacrificed volume. It also reduces the lifetime of that chip. If you have a new algorithm two years down the line, is that chip still as valuable as it was? It’s an interplay of a lot of things. Some algorithms at the edge do tend to settle. The industry finds an optimal algorithm after a few years of research, and that optimal algorithm will work for a lot of cases. We have seen that with the CNN (convolutional neural network), we have seen that for wake-word detection, handwriting recognition — you need to find an optimal algorithm for a certain specific use case.”

Defining the workload
Customize starts with understanding exactly what the workload looks like, and that gives advantages to certain players. “Most of the hyperscalers have formed their own chip units, and they’re creating chips for very high coverage workloads in their data centers,” says Nick Ni, director of product marketing for AI and software at Xilinx. “For example, if Google sees that ‘recommendation’ type of neural network as one of the highest workloads in the data center, it makes sense for them to create a dedicated chip and card for that. If the number two workload is speech recognition, it makes sense to do that. If the number three is the video transcoding for YouTube, they’ll do that. There are loads of opportunities, but not everybody is Google. Several years ago, Google published a well-received paper where they were showing the diversity of workloads in their data center, and it was very diverse. No single workload was taking more than 10%. That means there is a huge amount of long-tail workloads that needs to be optimized.”

Most customization is for inference. “When it comes to training, you need floating point support,” says Synopsys’ Diamantidis. “They need backward propagation of weights and the big software environment around it. But if you consider a solution that is 100% applied toward inference, it has fixed point, probably eight bits or even lower precision. The care-abouts are different. If the model is fixed, does it make sense to actually have flavors within the inference infrastructure itself, meaning somewhat customized solutions for voice, somewhat customized solutions for video processing, for example, for some of the heavy hitter applications? The hyperscalers are investing toward silicon solutions for inference that are more customized for their high-level models and solutions in the AI space. But where you’re trying to run a large number of different kinds of applications, there’s probably more need for configurability and flexibility.”

There is a virtuous circle here. “The TPU was built to cater to the particular workloads within the Google data center,” says Synopsys’ Tadikonda. “It was originally created because they realized that if they have to crunch this amount of data, they would need to build so many data centers to handle that data complexity and computation. That’s where the law of economics made them build the TPU. The first TPU was a hunk. It was extremely power consuming and big. But they have refined it. They have learned and learned. They can do it because that’s their job. That’s what Google is.”

Not everyone has the feedback loop available to Google, but other companies do have options. “One of the key components that we see is the focus and emphasis on getting the right architecture choices early on,” says Siemens’ Saha. “It’s not about what somebody thinks would be right. It’s not about making an intuitive decision based on past performance, because there are so many unknowns right now. What the industry is doing is to make data-driven decisions early in the design cycle, so you have the ability to quickly change if you see something not working.”

Those decisions can be at the macro level or at a more detailed level. “How close do you have your memory elements to the compute elements?” asks Saha. “How often you do the memory fetches happen, because reads and write will have a direct impact on the overall energy efficiency. The industry is looking at new architectures, so nobody knows exactly what will work. You need to be malleable, but you need to make sure you have enough data before they make those decisions.”

Hardware and algorithm churn
Another factor that influences where you land on the continuum is how fast the hardware needs to evolve and how quickly the algorithms evolve. That determines the time that the data center owner has to make money from hardware they buy, and it establishes the price they are willing to pay. It also caps the total cost for development of the chips.

What is the lifetime of a chip in the data center? “Typically, a chip or board will stay for three or four years,” says Xilinx’s Ni. “Some of the more aggressive data centers may upgrade around that timeframe, some may go for longer. For artificial intelligence, we can follow the Google TPU announcements. In the last six years or so, they have had four versions of TPU. So that’s like every year or two they’re swapping their internal hardware to optimize for these fast-changing workloads like AI.”

Looked at differently, there is an opportunity to get into the data center perhaps every 18 months. “It’s not easy to disrupt that market,” says Saha. “There are two parts to it — how often they replace their existing data center chips, and how often they add new things. I see almost all the data centers try out newer things. Almost everybody who is building a data center chip is working with some partner or some end customer. How often do they replace existing stuff, or things that are working? They will try to maximize the life of the chip as long as it’s working. Once you have gotten into a data center, it’s a long decision and it’s very hard to replace. That’s why you see so many investments into these large data center chips. There is a bet from certain sections of the investment community that it is a winner takes-all-market, or there will be one or two or three winners that will grab a lion’s share. Once they get in, it will be very hard to replace them.”

Designing for the future
What you start designing today has to meet the demands in about 18 months. “When we decide to harden blocks within are chips, we also have to optimize for certain precision,” says Xilinx’s Ni. “For example, we made certain choices around integer eight. We had to bet a little bit that by the time this product becomes mainstream, 8-bit is still mainstream. We also ensured that we could deal with a mixed-precision network, where half of it is 8-bit, another quarter is 4-bit, another quarter is 1-bit. For that, we implement the 8-bit part in the AI engine, which runs basic performance very fast, and then you can implement the 4-bit and the 1-bit MAC units in the FPGA fabric.”

Design times and algorithm evolution are of the same magnitude. “In 18 months, the application may very well be different,” warns Tadikonda. “I don’t think data scientists today are going to guarantee to anyone that they’re going to be running the same model in 18 months that they’re running today.”

There are many decisions that have to be made. “Quantization will probably be the single biggest factor in a lot of energy efficiency metrics,” says Saha. “Quantization will have more impact on the inference side, which is dispersed between both data center as well as edge, but there is an aspect of quantization on the learning side, as well. Whenever you quantize to a lower number of bits, it means you are trading energy efficiency over accuracy. It’s more efficient, but it’s not as accurate. In training you might require floating point, but there are newer types of floating points. When Google designed its next generation TPU, they created bfloat16, which is “brain floating point” for training. It’s very different from IEEE float, and it gives the benefits of floating point in accuracy, but also has a significant energy efficiency benefit.”

This can make the economics difficult. “For an ASIC of that size, for that much effort and with such rapid change, only a few companies can justify the economics,” says Tadikonda. “Algorithms are changing because the use cases on this data are increasing. What you thought was efficient today is not efficient tomorrow. To catch up and be on the cutting edge, you have to keep innovating or reinventing these ASICs. Google has an advantage. The reason they’re able to churn so fast is because they have so much data. They are learning a lot from their TPUs, and they know what they need to change to make their applications run better. If I am a third-party silicon developer, I do not have that data. I have to rely on my customers to provide that, so that turnaround loop would be longer. Google is in a very unique situation.”

It puts pressure on verification, as well. “Verification of the floating-point hardware will be crucial to meet the performance and power requirements for these chips,” says Rob van Blommestein, head of marketing at OneSpin. “Verification of floating-point hardware designs has long been considered a significant challenge. FPUs combine the mathematical complexity of floating-point arithmetic with a wide range of special cases that require complex control paths. What’s needed is a formal verification solution that verifies that the result of arithmetic operations, as computed by the hardware floating-point unit (FPU), accurately matches the IEEE 754 standard specification.”

Conclusion
It has often been said that data is the new oil, and AI is one area where that connection is becoming obvious. An architect can only visualize so much. They need access to the data that helps them to refine or build better products. That is why data center processors are so sticky. Once you are there, you have access to the data you need to stay there.

The only other way is to speed up the design process, such that the economics changes. And ironically, AI is the only disrupter that has shown the potential for those kinds of leaps in productivity.

Related
Architectural Considerations For AI
What will it take to be successful with a custom chip for processing a neural network targeting the data center or the edge? A lot of money is looking for that answer.
Sweeping Changes Ahead For Systems Design
Demand for faster processing with increasingly diverse applications is prompting very different compute models.
Data Overload In The Data Center
Which architectures and interfaces work best for different applications.
Big Changes In AI Design
Experts at the Table: Why it’s becoming easier to develop AI for edge applications, and how AI will be partitioned between the edge and the cloud.



Leave a Reply


(Note: This name will be displayed publicly)