Taking Sapphire Rapids For A Spin To Experience Intel’south quaternary-Gen Xeon Accelerators Firsthand
It wasn’t all that long ago that we had our first experience with Intel’south 4th-generation Scalable Xeons, code-named Sapphire Rapids, and checked out some live benchmarks. In the demonstration, Intel put its pre-release silicon up confronting the current best from AMD, a server with a pair of Milan-based EPYC 7763 64-cadre CPUs, and the blue team’s various acceleration technologies secured win after win in some common workloads. The Intel server had a pair of threescore-core processors with Quick Assist Technology (QAT) acceleration and a full terabyte of RAM. This monster machine is the kind of matter Intel expects to encounter warehoused in datacenters pushing the envelope on operation in various cloud computing and server workloads.
Not satisfied to just shows these off in a controlled surround, Intel offered us the chance to put our easily on one of these beastly servers ourselves. When FedEx delivered the 2U rack-mountable chassis stuffed with the same configuration every bit our in-person meeting, we were able to recreate Intel’s data center testing experience in our own lab, complete with tons of horsepower, screaming fans, and some noise-canceling headphones to keep usa company.
Unfortunately, nosotros tin’t disclose much more detail most this organisation; given its preproduction status, Intel is notwithstanding non quite gear up to announce model numbers, cache configurations, cadre speeds, and so on. We’re certainly going to honor that asking but don’t fret, all will be revealed in due fourth dimension. The good news is that at that place were also some tests we could run that weren’t part of our initial demo, and Intel has immune us to share those results with yous hither besides.
Setting Up An Intel Sapphire Rapids Xeon Server
Nosotros’ve seen some leaks, and nosotros’ve seen these CPUs in person, but the real proof points prevarication in running benchmarks for ourselves. The fun thing virtually Intel’s hands-on experience is that the server we received had no operating organization. Every bit if to show that there’s no magic software configuration, the company provided stride-by-stride instructions to replicate its results for ourselves. We got to install Ubuntu 22.04 and CentOS Stream 8, clone publicly-bachelor Git repos, and execute all the tests we saw in person in September at Intel Innovation for ourselves. While installing an OS from scratch isn’t the most glamorous job, we know our software setup will be the same publicly-available stack that the rest of the world uses. That means the following benchmark results we provide should be indicative of Intel expects performance to look like in existent world deployments.
Nosotros started with the latest long-term supported (LTS) version of Ubuntu Server, 22.04, and aside from enabling SSH and installing drivers for the pair of 100-Gigabit Ethernet cards there was no extra configuration. Some of the tests that Intel demonstrated require additional client hardware, and the clients were beefier than what we have on paw.
As such, Intel gave usa remote admission to a customer/server pair, and we were able to first confirm the hardware and software installed on each and replicate those tests remotely. It’s non quite the same as running these in our office, but we were comfy with this compromise. So the NGINX, SPDK, and IPSec tests you’ll run across before long were performed in that fashion.
As part of the criterion setup procedure, Intel provided purpose-built scripts, step-by-step instructions, and recommended BIOS settings for each. In between each examination, we installed the requisite software packages, configured the BIOS as needed, and rebooted the system. Each examination was run iii times, and we took the median effect for each task for our graphed results beneath. Our test arrangement’s configuration was identical to that which Intel demonstrated in September, so the results should be the same, if everything goes according to expectations.
For reference, our system’south specifications are every bit follows: 2x pre-production fourth Gen Intel Xeon Scalable processors (60 core) with Intel Avant-garde Matrix Extensions (Intel AMX), on pre-production Intel platform and software with 1024GB of DDR5 memory (16x64GB), microcode 0xf000380, HT On, Turbo On, SNC Off. BIOS settings depended mostly on whether virtualization was required for each job, and and so it was enabled and disabled as necessary.
New Benchmarks For Intel Sapphire Rapids 4th Gen Xeon
Before we get into a few remotely-performed tests, there are a couple of workloads we haven’t seen Sapphire Rapids run yet, so let’s commencement with those. First up is LINPACK, which we tested using Intel’s optimized version based on its oneAPI math libraries. We were able to test with both an AVX2 codepath as well every bit AVX512.
As you can come across, the AVX512 version is but about 90% faster than AVX2. Not that 3.5 teraflops with AVX2 is pokey or annihilation, but with AVX512 extensions the platform was able to complete the same workload in 55% of the time. We all know that AVX512 is fast when a job tin take advantage of it, and this certainly is 1 of those.
Adjacent up we also ran a couple of molecular dynamic simulations. LAMMPS is an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator, which is a classical molecular dynamics code with a focus on materials modeling. It runs on all kinds of platforms, including CPUs and GPUs. The CPU version takes advantage of AVX512, and is congenital confronting the oneAPI libraries.
Unfortunately these results are a little out of context, equally nosotros don’t have competing hardware or even previous-generation Xeons at our disposal currently. What we can tell yous, however, is that these results are within a percentage point or and then compared to what Intel showed us at its demo at Innovation 2022.
NAMD is another parallel molecular dynamics lawmaking designed for high-operation simulation of large biomolecular systems. It tin can scale up to and beyond 500,000 cores, so the 120 cores we have here should be flexed pretty hard. Simply like LAMMPS, NAMD also uses AVX512 care of oneAPI. Rather than a chart, nosotros go a single result: 3.25 nanoseconds per twenty-four hour period of simulation. It seems that scientists would desire a number of these systems working together, and that does put into context why the application tin can scale across 500,000 CPU cores or more. It’due south just an awful lot of math.
Last up is the ResNet50 Prototype Recognition benchmark. This uses version 1.5 of the ResNet50 machine learning model, which has some tweaks over the original model to ameliorate recognition slightly and convalesce a bottleneck in downsampling. So in addition to more accuracy, it should also be somewhat faster than the original. All of that is important to annotation when running a benchmark, because these numbers are not directly comparable to version 1.
The iii results above apply 32-fleck floating indicate math and 8-bit integer math, the latter of which is faster to execute in parallel and yet has plenty precision for AI tasks to not significantly alter the results. Simply making that switch and using VNNI, Intel’s Vector Neural Network Instruction set up, is enough to neatly quadruple performance in this benchmark. But so when the test moves on to using AMX, or Intel’s Avant-garde Matrix Extensions, performance is more than doubled over again, providing a 9x uplift in operation.
Testing Intel Sapphire Rapids Assertions: Compression And Database Workloads
Now that nosotros’ve covered the previously-unpublished benchmarks, allow’s movement onto confirming Intel’s ain benchmark numbers. Every bit mentioned previously, nosotros didn’t accept the physical access to an AMD Milan server, similar to what Intel used for its Sapphire Rapids comparisons, but we tin at least validate whether Intel’s claims hold up. There’due south enough publicly-bachelor information effectually the web relating to performance on other platforms simply we’re just not comfortable comparison our ain controlled work to others because there are and so many variables. This ways Intel’s claims have to stand or autumn on their ain merit for at present.
There are two categories of these tests: those which demand a 2nd server every bit a client, and those that don’t. We’re going to focus on the latter since we needed to utilise Intel’s remotely-accessible surround for a customer.
That’due south where the QAT in QATzip comes in: Intel’s Quick Assist Technology, which is the family unit name for several Xeon accelerator technologies, uses two methods to speed up the process. The kickoff is Intel’south ISA-50, or Intelligent Storage Acceleration Library. Compared to the standard ZLIB pinch library’s Gzip functionality, the performance promises to be more than 15 times faster. The 2d is using Intel’s defended acceleration hardware for QATzip, which the visitor says should push button functioning even higher.
Call back that in Intel’s own tests, the pair of 64-core Milan CPUs in the EPYC organization actually won with Intel’s ain ISA-L library, and 8 more cores was likely non going to make up the deviation. However, employing Intel QAT dispatch built into our server did 2 things: not only was operation just curt of twoscore% higher, but the criterion reported that it wasn’t using all 120 cores. The server was actually utilizing but four cores, which nosotros could verify by using the
command in a 2nd SSH last window. That means that the residue of the CPU cores in our system could get busy doing other things. Looking dorsum at Intel’s numbers, we tin can besides encounter that using the Quick Assist Hardware is fast enough to surpass current gen AMD EPYC system, besides.
Side by side is RocksDB, which is a information indexing storage system. Important (and famous) competitors include Elasticsearch and Amazon’south OpenSearch. The idea behind these central/value indexing systems is to make huge datasets searchable with minimal latency. RocksDB got its start indexing Facebook and LinkedIn users, posts, job listings, and so on. It’s also used as the storage method for pop SQL databases like MySQL and NoSQL databases like MongoDB and Redis. These tools are of import for keeping the internet searchable quickly. Intel’s QAT accelerates data pinch and aims to speed up searching and finding records.
The “No Intel IAA” issue above uses the Zstandard existent-time compression algorithm. The job is an fourscore/twenty workload (meaning fourscore% of the operations are read operations) hovered around 100 microseconds in latency and could handle around 2,500 kOP/s. Using Intel’s In-memory Analytics Accelerator (IAA), which is role of Quick Aid Technology, the latency dropped in half to 48 microseconds and information technology pushed iv,291 kOP/southward. That’due south basically identical to the numbers Intel published, showing that once more QAT is really handy in another common server workload. IAA didn’t reach this with a huge storage footprint, either; the sample dataset was around 43 GB on the disk with ZSTD, while the IAA version was only slightly bigger at 44.6 GB. Of course, that data was loaded into the huge ane TB of RAM in the system to keep latencies as depression equally possible, but persistence is mandatory unless yous want to rebuild an index with each reboot.
Adjacent up information technology’s time to run our client/server tests, which for our purposes means a remote SSH session with another Sapphire Rapids server. So we’ll wrap up everything we’ve seen.