Seow Lim, Darin Clemmer, Gerald Kuntsman, and Swanand Mhalagi contributed to the development of this article.

Artificial Intelligence (AI) has been a hot topic, especially since the introduction of end-user applications, such as ChatGPT. What has not been considered as much is how AI applications perform on various hardware configurations. Benchmarking performance is the best way to determine the most efficient hardware for different AI use cases.

This article will review benchmark results of AI-optimized Intel CPUs hosted on phoenixNAP Bare Metal Cloud. We consider 3 popular AI use cases: image recognition, natural learning processing (NLP), and recommendation engine.

Benchmark Methodology

Three AI programs were selected for benchmarking based on their distinct AI use cases - ResNet50 (Image classification) and SSD-Resnet34 (Image Object detection), BERT (Natural Language Processing), and DIEN Recommendation models.

The applications were run on different precisions, such as FP32 (floating-point 32), INT8 (integer 8), and BF16 (bfloat16) precision, which are data types used to represent real numbers with higher precision, integers with lower precision, and intermediate precision, respectively.

The choice of precision comes down to the trade-off between computational accuracy and computational efficiency. FP32 is ideal for higher accuracy but requires more computing resources and longer computation times, which may not be practical for certain applications. INT8 is ideal for faster computation times and requires fewer computing resources, which makes it ideal for deployment in real-time applications where speed and efficiency are crucial. BF16 precision is specific to SPR, in addition to FP32 and INT8 available on ICX, and has newer instructions (AMX) that provide added benefits for inferencing.

SPR has BF16 precision in addition to FP32 and INT8 available on ICX. SPR also has a newer set of instructions (AMX) which provide added benefit for inferencing.

Benchmarked AI applications

List of AI Program Modes for Testing

  • Real-time performance typically requires low-latency, high-speed processing capabilities. This means that the hardware must be able to process data quickly and respond to input in real time. For example, in the case of ImageNet, the AI system must be able to process/inference the images in real-time, like steaming. Therefore, real-time AI applications often require high-end hardware, such as GPUs or specialized AI chips, with low-latency memory and high-speed I/O interfaces.
  • Maximum throughput requires hardware that can process large amounts of data efficiently. This typically involves parallel processing capabilities and high memory bandwidth to handle large datasets.

List of AI Program Hardware Acceleration

  • AMX (Advanced Matrix Extensions) is a set of instructions that are part of Intel's 4th Gen Xeon architecture. AMX is designed to accelerate the processing of matrix operations that are commonly used in neural networks, such as convolutional layers. AMX can perform matrix multiplication and accumulation operations with low power consumption and high efficiency.
  • VNNI (Vector Neural Network Instructions) is a set of instructions that are part of the Intel AVX-512 instruction set architecture. VNNI is designed to accelerate the processing of convolutional neural networks (CNNs) by accelerating the inner product operations that are commonly used in CNNs. VNNI can perform multiple vector dot-product operations in a single instruction, which can improve the performance of neural network computations.

List of Optimization Libraries

  • FP32 (floating-point 32): Data type used to represent real numbers with higher precision, but it requires more memory and computation compared to INT8. Machine learning models that use FP32 can achieve higher accuracy but require more computing resources and longer computation times. This can result in longer training and inference times, which may not be practical for certain applications.
  • INT8 with Intel Neural Compressor (data type used to represent integers with lower precision, but it requires less memory and computation compared to FP32. Machine learning models that use INT8 can achieve faster computation times and require fewer computing resources, which makes them ideal for deployment in real-time applications where speed and efficiency are crucial. However, the lower precision of INT8 can result in reduced model accuracy, which may not be acceptable for certain applications)
  • BF16 precision, this type of precision is specific to SPR in addition to FP32 and INT8 available on ICX. SPR also has a newer set of instructions (AMX) which provide added benefit for inferencing.

Benchmarking Intel CPUs for AI

Intel has been actively contributing to the development of artificial intelligence (AI) technology. Most notably, Intel CPUs have been used for facilitating machine learning, natural language processing (NLP), and deep learning.

To further improve the performance of their line-up, Intel has released CPUs specifically designed for AI, such as Intel Xeon Scalable Processors and the Intel Xeon Phi Coprocessor.

The following are the servers used in benchmarking:

3rd Gen Xeon SP System Configuration

  • phoenixNAP Bare Metal Cloud Instance Type: d2.c5.medium
  • Dual Socket 8352Y 3rd Gen Xeon SP
  • 256GB DDR4 2933MT/s
  • 2x INTEL 2TB P4510 NVMe
  • Operating System: CentOS Stream 8
  • Kernel: 6.2.2-1.el8.elrepo.x86_64

4th Gen Xeon SP System Configuration

  • phoenixNAP Bare Metal Cloud Instance Type: d3.m6.xlarge
  • Dual Socket 8452Y 4th Gen Xeon SP
  • 256GB DDR5 4800MT/s
  • 2x INTEL 2TB P4510 NVMe
  • Operating System: CentOS Stream 8
  • Kernel: 6.2.2-1.el8.elrepo.x86_64

NLP programs involve processing large amounts of unstructured textual data, requiring high processing power and memory. Hardware with high-end CPUs, such as the Intel Xeon Scalable processors, can provide the necessary processing power to handle complex NLP tasks.

Recommendation engines require more memory and storage than processing power, as they involve accessing large datasets to make recommendations. Hardware with large memory capacity and storage, such as Intel Optane DC Persistent Memory, can provide the necessary resources to handle recommendation engine tasks efficiently.

In contrast, image recognition AI programs require high computational power and processing speed to process large amounts of image data. Hardware with high-performance GPUs, such as the Intel Xe GPU, provides the required processing power to handle complex image recognition tasks.

Intel hardware and software stack used for benchmarking.
Figure 1: The benchmarked hardware & software stack

AI Inferencing Performance on Intel Hardware/Software Stack

The method used to evaluate AI inferencing performance on the benchmarked Intel hardware/software stack involved the following steps:

  • Identify the AI program to be benchmarked and the specific hardware configurations to be tested.
  • Select a standard benchmarking tool, such as TensorFlow or PyTorch, to evaluate the AI program's performance on each hardware configuration.
  • Install and configure the benchmarking tool on each hardware configuration.
  • Run the benchmarking tool and record the results for each hardware configuration, including metrics such as processing speed, memory usage, and accuracy.
  • Compare the results to identify the most efficient and effective hardware configuration for the specific AI program.
  • Repeat the benchmarking process with different benchmarking tools or configurations to validate the results, if necessary.
  • Following this method, organizations can determine the most optimal hardware configuration for their AI applications, maximizing performance and minimizing costs. Benchmarking AI programs on different Intel CPUs is a critical step in supporting complex and data-driven decision-making processes.

To run the benchmarks on Gen 4 Sapphire Rapids, the methodology was slightly different as it supports FP32, INT8, and BF16 precision. To enable all available hardware optimizations, Intel TensorFlow and additional software patches are required.

The results of each test may vary up to 5%. To ensure consistent results across all test cases, each scenario was run up 4 times.

BF32 can be skipped as very few models support that precision as of now.

Methodology and Results

The methodology used for benchmarking involves measuring the overall throughput of various models. In the case of the Bert_large and Dien models, throughput is calculated by determining how many input sequences, or examples, can be processed within a given time frame.

For the Restnet50 model, throughput is calculated by measuring how many images can be processed per second based on the output directory. The benchmark tests compared the performance difference from the following variables: CPU, instruction set (VNNI VS AMX), use case (max throughput vs. real time), and precision (FP32, float8 and Int8).

The results of our benchmarking are shown in the table below:

Performance of Gen3 (8352y) vs Gen 4 (8452y) Xeon Scalable CPUs in real time
Figure 2: Performance of Gen3 (8352y) vs Gen 4 (8452y) Xeon Scalable CPUs in real time
Gen3 (8352y) vs Gen 4 (8452y) Xeon Scalable CPUs max throughput test
Figure 3: Performance of Gen3 (8352y) vs Gen 4 (8452y) Xeon Scalable CPUs max throughput

Figure 2 and Figure 3 compare the performance between Gen 3 and Gen 4 Intel Xeon Scalable CPUs under the same criteria (FP32 and VVNI). Upon further analysis, the results clearly indicate that upgrading from a Gen 3 to a Gen 4 CPU resulted in a performance increase of 100% - 200%, with the most significant gains seen in DIEN, a type of recommendation engine.

DIEN has significant computational requirements, particularly when processing large datasets or using complex model architectures. Higher computational power can lead to faster processing of user behavior sequences and item interactions, resulting in more accurate and timely recommendations. While the benefits of higher computational power are also evident in models like ImageNet and BERT, the performance increase is the most significant in recommendation models like DIEN.

Furthermore, we observed that maximal throughput mode benefits more from Gen 4 CPU than real-time mode for all models, with a batch size of 50 providing almost a 275% increase in performance for the DIEN model. This is because the real-time mode is optimized for low latency and uses a smaller batch size, resulting in reduced overall CPU usage. In contrast, the maximal throughput mode is optimized for high throughput and uses a larger batch size, which can increase overall CPU usage but lead to faster processing times.

ImageNet Xeon Scalable CPUs throughput performance comparison in different precisions
Figure 4: ImageNet Xeon Scalable CPUs throughput performance comparison in different precisions
Bert_LARGE Xeon Scalable CPUs throughput performance comparison using different precisions
Figure 5: Bert_LARGE Xeon Scalable CPUs throughput performance comparison using different precisions

Figure 4 and Figure 5 analyze the impact of different data precisions on the performance of various models. The results demonstrate that using INT8 (8-bit integer) inference provides significantly higher performance than using FP32 (32-bit floating-point) or Float16 (16-bit floating-point) for image processing models, as exemplified by ResNet50, an image classification model.

On a Gen 3 Intel Xeon Scalable CPU, we observed a 6x performance improvement for INT8 over FP32, and on a Gen 4 Xeon Scalable CPU, we observed a 4x increase. We also found a 3.5x performance increase for an NLP model. These results demonstrate that the INT8 data type can provide better performance than FP32 due to its ability to leverage hardware accelerators and more efficiently utilize the CPU.

ImageNet throughput performance comparison using FP32 on Xeon Scalable CPUs
Figure 6: ImageNet throughput performance comparison on different instruction set for ImageNet model using FP32
Intel Xeon Scalable CPUs tested: ImageNet throughput performance comparison
Figure 7: ImageNet throughput performance comparison on different instruction set for ImageNet model using Float16 precision

Through our testing, we determined that different instruction sets have no discernible impact on the overall performance with FP32 precision for both image and recommendation engine models, as AMX only benefits INT8 and BF16 precisions, as shown in Figure 6. For example, in Float 16 precision, AMX shows 4.5x performance improvement compared to VNNI.

Conclusion

In this research, we identified that the choice of CPU plays a crucial role in achieving optimal performance for AI applications. The results demonstrate that Intel CPUs are well-suited for AI applications, with Gen 4 Xeon Scalable CPUs providing a performance boost for all AI models. This is especially true in the case of recommendation engines with maximal throughput mode,

Furthermore, our findings demonstrate that different precisions can have a significant impact on performance, with INT8 providing better performance than FP32 for image processing and NLP models.

In conclusion, our benchmarking methodology offers a comprehensive approach for evaluating and optimizing AI program performance on different Intel CPUs, which can help organizations make informed decisions regarding their AI hardware configurations.