Here at Blackcore, overclocking is our specialty, and we have spent considerable time perfecting our practices. If you’d like to learn more about our overclocking practices, you can read here. This article will focus on how we test our overclocking to our stringent quality standards and the tools we trust to help us do that.
Monitoring
When overclocking, many different variables are being changed – but how do we know if the changes we are making are having any impact, let alone the intended impact?
We utilize a select few tools to monitor system sensors and statistics. We’re looking to confirm that metrics such as clock frequencies, various temperatures, and multiple voltages are within our expected parameters for success.
If these metrics don't match expectations, that may mean that we are hitting some sort of throttle, or there are conflicting settings that may need adjustment to unlock the performance goal that we are trying to achieve.
HWInfo64 is an incredibly useful tool. It’s a Windows-based tool that is not ideal for a variety of reasons, but it has the most extensive sensor compatibility and coverage that we’ve found. It will show things about various system components that no other tool that we’ve worked with will, which all tend to be particularly useful when dialling in a complex overclock.
Perf is a profiling tool for Linux. Used with the command: “perf stat [benchmark]”, it outputs some rather complex information. Other low-level software profiling tools like AMD's uProf or Intel® VTune™ can also be useful, but are harder to configure and more useful for actual software developers than hardware profiling. We use perf predominantly for running benchmarks and checking that the clockspeed (cycles) and IPC (instructions per clock) are within expectations.
Turbostat is a simple Linux tool that can verify key parameters of the CPU – such as clockspeed, c-states, temperature, and package power draw. In the past, a similar tool called i7z was popular, but this has not been updated for a long time, and our experience is that turbostat is not as reliable on newer generations of CPU. Whilst not as sensor-rich as HWInfo64, turbostat is critical for checking that the CPU is running at the correct clockspeed – particularly useful for checking how the clockspeed is being affected by offsets due to AVX code, etc.
"When everything looks normal, but performance is tanking.”
It’s important to note that some monitoring tools can provide incorrect results. This can often be caused by hidden hardware protection mechanisms or outdated software picking up the wrong values.
Clock Stretching
- Protection mechanism against low voltage.
- Clock slows, but reported frequency remains high.
- The monitoring method may be wrong, or the interval may be too long.
- This can be caught in HWInfo64 by checking that “effective clockspeed” matches expectations.
A real-world example - ‘Phantom Throttling’
- Coined internally during Intel X299 development.
- Protection mechanism triggered when VCCIN was “too low”.
- Circumventing this check allowed otherwise unstable frequencies.
- Delicate balance between various system voltages & adequate cooling.
Stress Testing
Whilst dialling in an overclock, it is key to constantly stress test the system to check for stability as various settings are changed.
The primary tool for us is Prime95 because it is reliable and consistent. The software pushes the power draw and temperature further than any other and works by calculating FFT (Fast Fourier Transform) math. Prime95 can be tuned to run with various data set sizes, which essentially means tweaking where data is stored and which part of the core, cache, and RAM hierarchy is stressed.
This comes with some important considerations, however:
Firstly, Prime95 is best suited to SSE instruction sets as the AVX code that Prime95 utilises is incredibly “heavy”. This means that the work the CPU must do to perform these instructions is much harder than typical (SSE) instructions. This means more heat and more power*. Due to this, most CPUs will down-clock themselves to perform AVX instructions. Additionally, the way Prime95 uses AVX instructions is not very typical of a real-world scenario. So even if adjusting the overclock to pass the AVX tests, this could potentially be leaving performance on the table.
Prime95 is available for Windows and Linux (as the binary package ‘mprime’).
Another useful tool is the Linux tool stress-ng, which is flexible and can be more granularly tuned to test more parts of the system. It provides very detailed reporting and configuration. However, it is worth considering that this software is typically less stressful on the system than Prime95, but could be seen as more real-world representative. Run stress-ng with the “--metrics-brief" flag to get easy-to-read performance statistics. Go a step further and wrap the process with perf to get a more detailed overview of the benchmark.
[root@3100-RZ ~]# perf stat taskset -c 1-11 stress-ng -c 10 -t 10s --metrics-brief stress-ng: info: [541213] setting to a 10-second run per stressor stress-ng: info: [541213] dispatching hogs: 10 cpu stress-ng: metrc: [541213] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s stress-ng: metrc: [541213] (secs) (secs) (secs) (real time) (usr+sys time) stress-ng: metrc: [541213] cpu 271482 10.00 94.90 0.01 27146.31 2860.58 stress-ng: info: [541213] skipped: 0 stress-ng: info: [541213] passed: 10: cpu (10) stress-ng: info: [541213] failed: 0 stress-ng: info: [541213] metrics untrustworthy: 0 stress-ng: info: [541213] successful run completed in 10.03 secs Performance counter stats for 'taskset -c 1-11 stress-ng -c 10 -t 10s --metrics-brief': 94,916.93 msec task-clock # 9.454 CPUs utilized 571 context-switches # 6.016 /sec 351 cpu-migrations # 3.698 /sec 5,726 page-faults # 60.326 /sec 512,528,977,767 cycles # 5.400 GHz 50,126,007,759 stalled-cycles-frontend # 9.78% frontend cycles idle 1,106,891,820,122 instructions # 2.16 insn per cycle # 0.05 stalled cycles per insn 118,062,220,490 branches # 1.244 G/sec 1,546,384,233 branch-misses # 1.31% of all branches 10.039339863 seconds time elapsed 94.899808000 seconds user 0.013776000 seconds sys
Benchmarking
We’ve chosen to highlight these as they’re simple to run on any hardware and provide a simple look at the key relevant aspects of a system's performance as relates to electronic trading. That is, PCIe, RAM, and cache latency.
Prime95 (mprime) provides a consistent workload; we use this to calculate the CPU's IPC (Instructions Per Clock), which is a general indicator of performance. IPC is instruction-dependent, so it’s important to have the context of which specific benchmark the number is quoted from. It can be easy to manipulate benchmark results to a desired result rather than have consistent metrics.
Intel MLC provides information about a system's cache and RAM performance. At a basic level, it can report on cache and RAM latency and bandwidth, and various data sizes. With more fine-tuning, we can test very specific setups to mimic more real-world memory usage. Another great use case is to measure performance when moving across cores that are on different tiles or NUMA nodes.
pcie-lat is a simple Linux tool for measuring the latency of the PCIe bus. It is as simple as compiling the driver on the target system, binding it to an installed PCIe card, and then running the benchmark. The benchmark sends a signal to the card and waits for the response to be returned. It’s important to note that the implementation of the card being used for testing will affect the measurement. A faster card will process the data and return the response faster than a slower card. Additionally, different PCIe slots may be faster or slower than others. Sometimes this will be non-existent or negligible based solely on the length of the copper trace in the motherboard. At other times, due to hardware architecture factors, a PCIe slot may be bifurcated through a switch or run through a re-timer, which may affect the latency. So, it is again important to have the full context of the test - which card is being used and which PCIe slot it is in.
C2clat measures the latency between each core in a matrix fashion. This is useful not only for seeing the performance of an individual core's cache, but also incredibly useful in understanding how a CPU's architecture is connected, and aids in software architecture design when leveraging CPU pinning. This is increasingly important with modern CPUs, which are making use of a variety of architectures such as Intel’s ring-bus vs mesh, the increased use of tiled and chiplet-based design, and even packages combining “big/little” cores.
When running any benchmark (or indeed other tool), it’s important to understand what the test is doing and what variables may affect its outcome. Otherwise, you will have no context for whether the results are meaningful or expected. Being able to repeat and have a consistent set of tests from server to server and platform to platform is key!
It is key to check that results react accordingly to overclock changes and align with the monitoring output.
Consider…
IPC is instruction-dependent. This means that different benchmarks will give different IPC numbers, so comparison of like-for-like is key.
Frequency is architecture-dependent. This means that we can not blindly compare performance statistics, such as CPU clock speed, across architectures. 5GHz on an AMD processor is not the same as 5GHz on an Intel processor. In fact, even within a CPU vendor, there may not be consistency on the measured results; 5GHz on Raptor Lake is not the same as 5GHz on Sapphire Rapids.
Hyper-threading will affect metrics like IPC & frequency. Hyper-threading often cuts IPC by ~50% but doubles the number of threads. Note that it is not an exact 50% reduction as there are other overheads to be considered.
Instruction set matters. Whether the software application code is using SSE or the various AVX levels will affect performance. AVX will often result in a down-clock in frequency, which may be alarming at first, but could still result in faster performance due to the increased throughput of certain AVX instructions.
Best practices from the past may not apply to new architectures. We’ve found that even small architecture differences can result in wildly different behaviour, so it’s always best to start with a fresh sheet of paper approach, but rely on existing experience to determine the direction of overclock development.
It is imperative to use a variety of tools to collect metrics on how an overclock is affecting system performance and cross-reference their results against known good benchmarks. Knowledge and experience of the operating parameters, tools, and benchmark results are key to match those results against expected outcomes to ensure that the stated overclocking goals are met, reliable, and will perform as expected throughout the lifecycle of the product. If you are self-overclocking in your lab or have an alternate vendor product in your racks, it's key to understand how the benchmarking is taking place, what tools are being used, and that the system has been adequately stress tested before it moves to production.
If you would like to learn more about Blackcore, please email your account manager. If you are new to Blackcore and would like to learn more, you can email [email protected] or book a call with our CRO Ciaran Kennedy here.