Inside an Overclockers Toolbox

Here at Blackcore, overclocking is our specialty, and we have spent considerable time perfecting our practices. If you’d like to learn more about our overclocking practices, you can read here. This article will focus on how we test our overclocking to our stringent quality standards and the tools we trust to help us do that.

Monitoring

When overclocking, many different variables are being changed – but how do we know if the changes we are making are having any impact, let alone the intended impact?

We utilize a select few tools to monitor system sensors and statistics. We’re looking to confirm that metrics such as clock frequencies, various temperatures, and multiple voltages are within our expected parameters for success.

If these metrics don't match expectations, that may mean that we are hitting some sort of throttle, or there are conflicting settings that may need adjustment to unlock the performance goal that we are trying to achieve.

HWInfo64 is an incredibly useful tool. It’s a Windows-based tool that is not ideal for a variety of reasons, but it has the most extensive sensor compatibility and coverage that we’ve found. It will show things about various system components that no other tool that we’ve worked with will, which all tend to be particularly useful when dialling in a complex overclock.

Perf is a profiling tool for Linux. Used with the command: “perf stat [benchmark]”, it outputs some rather complex information. Other low-level software profiling tools like AMD's uProf or Intel® VTune™ can also be useful, but are harder to configure and more useful for actual software developers than hardware profiling. We use perf predominantly for running benchmarks and checking that the clockspeed (cycles) and IPC (instructions per clock) are within expectations.

Turbostat is a simple Linux tool that can verify key parameters of the CPU – such as clockspeed, c-states, temperature, and package power draw. In the past, a similar tool called i7z was popular, but this has not been updated for a long time, and our experience is that turbostat is not as reliable on newer generations of CPU. Whilst not as sensor-rich as HWInfo64, turbostat is critical for checking that the CPU is running at the correct clockspeed – particularly useful for checking how the clockspeed is being affected by offsets due to AVX code, etc.

"When everything looks normal, but performance is tanking.”

It’s important to note that some monitoring tools can provide incorrect results. This can often be caused by hidden hardware protection mechanisms or outdated software picking up the wrong values.

Clock Stretching

Protection mechanism against low voltage.
Clock slows, but reported frequency remains high.
The monitoring method may be wrong, or the interval may be too long.
This can be caught in HWInfo64 by checking that “effective clockspeed” matches expectations.

A real-world example - ‘Phantom Throttling’

Coined internally during Intel X299 development.
Protection mechanism triggered when VCCIN was “too low”.
Circumventing this check allowed otherwise unstable frequencies.
Delicate balance between various system voltages & adequate cooling.

Side Note

Here’s how we observed “phantom throttling”

1) If VCCIN was too low (around 1.7v) 100% throttling will be seen. The CPU temperature is only 55C when it should be 85+C. Power draw is also only 300W instead of 450W+. CPU frequency does not change, which is why we refer to it as “phantom throttling”.

2) If we raise the VCCIN, the CPU does not throttle 100%, only certain times depending on power draw and vdroop, which may make the VCCIN drop below the throttle value.

In this case, CPU temperature is higher, and power draw fluctuates between 300W and 450W as the throttle gets triggered on/off.

3) If we keep raising the VCCIN voltage so it never goes below the throttle point, the overclock is no longer stable, as there will be multiple CPU cores dropping.

A fix was developed with the motherboard vendor to remove the protection mechanism for “low voltage”, after which it was possible to run the CPU with a low enough voltage that it was stable, and expected power draw, temperature, and indeed performance, were observed!

Stress Testing

Whilst dialling in an overclock, it is key to constantly stress test the system to check for stability as various settings are changed.

The primary tool for us is Prime95 because it is reliable and consistent. The software pushes the power draw and temperature further than any other and works by calculating FFT (Fast Fourier Transform) math. Prime95 can be tuned to run with various data set sizes, which essentially means tweaking where data is stored and which part of the core, cache, and RAM hierarchy is stressed.

This comes with some important considerations, however:

Firstly, Prime95 is best suited to SSE instruction sets as the AVX code that Prime95 utilises is incredibly “heavy”. This means that the work the CPU must do to perform these instructions is much harder than typical (SSE) instructions. This means more heat and more power*. Due to this, most CPUs will down-clock themselves to perform AVX instructions. Additionally, the way Prime95 uses AVX instructions is not very typical of a real-world scenario. So even if adjusting the overclock to pass the AVX tests, this could potentially be leaving performance on the table.

Prime95 is available for Windows and Linux (as the binary package ‘mprime’).

Side note

It’s worth noting that the more power a CPU uses, the more heat it generates. The more heat generated, the more voltage is needed to keep it stable. The more voltage required, the more power it uses....and you can see where this is going. Therefore, it is important to maximize the amount of performance possible, with the lowest amount of voltage needed, whilst removing the most heat possible. This is the primary reason we tune each system within known parameters but also take advantage of our advanced liquid cooling capabilities to ensure maximum performance AND stability.

Another useful tool is the Linux tool stress-ng, which is flexible and can be more granularly tuned to test more parts of the system. It provides very detailed reporting and configuration. However, it is worth considering that this software is typically less stressful on the system than Prime95, but could be seen as more real-world representative. Run stress-ng with the “--metrics-brief" flag to get easy-to-read performance statistics. Go a step further and wrap the process with perf to get a more detailed overview of the benchmark.

[root@3100-RZ ~]# perf stat taskset -c 1-11 stress-ng -c 10 -t 10s --metrics-brief
stress-ng: info:  [541213] setting to a 10-second run per stressor
stress-ng: info:  [541213] dispatching hogs: 10 cpu
stress-ng: metrc: [541213] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [541213]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: metrc: [541213] cpu              271482     10.00     94.90      0.01     27146.31        2860.58
stress-ng: info:  [541213] skipped: 0
stress-ng: info:  [541213] passed: 10: cpu (10)
stress-ng: info:  [541213] failed: 0
stress-ng: info:  [541213] metrics untrustworthy: 0
stress-ng: info:  [541213] successful run completed in 10.03 secs

Performance counter stats for 'taskset -c 1-11 stress-ng -c 10 -t 10s --metrics-brief':
         94,916.93 msec task-clock                       #    9.454 CPUs utilized
               571      context-switches                 #    6.016 /sec
               351      cpu-migrations                   #    3.698 /sec
             5,726      page-faults                      #   60.326 /sec
   512,528,977,767      cycles                           #    5.400 GHz
    50,126,007,759      stalled-cycles-frontend          #    9.78% frontend cycles idle
1,106,891,820,122      instructions                     #    2.16  insn per cycle
                                                  #    0.05  stalled cycles per insn
   118,062,220,490      branches                         #    1.244 G/sec
     1,546,384,233      branch-misses                    #    1.31% of all branches
      10.039339863 seconds time elapsed
      94.899808000 seconds user
       0.013776000 seconds sys

Benchmarking

We perform many benchmarks internally to validate various performance metrics and perform comparisons. Four key tools we use are: Prime95, Intel MLC, pcie-lat, and c2clat. Every Blackcore system is tested for at least two sets of 16-hour tests to ensure overall system stability at load before shipping. It’s not enough to “pass” a test and move on; testing for long-term stability and within known good parameters is a key tenant of why Blackcore servers are the number one choice for electronic trading.

We’ve chosen to highlight these as they’re simple to run on any hardware and provide a simple look at the key relevant aspects of a system's performance as relates to electronic trading. That is, PCIe, RAM, and cache latency.

Prime95 (mprime) provides a consistent workload; we use this to calculate the CPU's IPC (Instructions Per Clock), which is a general indicator of performance. IPC is instruction-dependent, so it’s important to have the context of which specific benchmark the number is quoted from. It can be easy to manipulate benchmark results to a desired result rather than have consistent metrics.

Intel MLC provides information about a system's cache and RAM performance. At a basic level, it can report on cache and RAM latency and bandwidth, and various data sizes. With more fine-tuning, we can test very specific setups to mimic more real-world memory usage. Another great use case is to measure performance when moving across cores that are on different tiles or NUMA nodes.

pcie-lat is a simple Linux tool for measuring the latency of the PCIe bus. It is as simple as compiling the driver on the target system, binding it to an installed PCIe card, and then running the benchmark. The benchmark sends a signal to the card and waits for the response to be returned. It’s important to note that the implementation of the card being used for testing will affect the measurement. A faster card will process the data and return the response faster than a slower card. Additionally, different PCIe slots may be faster or slower than others. Sometimes this will be non-existent or negligible based solely on the length of the copper trace in the motherboard. At other times, due to hardware architecture factors, a PCIe slot may be bifurcated through a switch or run through a re-timer, which may affect the latency. So, it is again important to have the full context of the test - which card is being used and which PCIe slot it is in.

C2clat measures the latency between each core in a matrix fashion. This is useful not only for seeing the performance of an individual core's cache, but also incredibly useful in understanding how a CPU's architecture is connected, and aids in software architecture design when leveraging CPU pinning. This is increasingly important with modern CPUs, which are making use of a variety of architectures such as Intel’s ring-bus vs mesh, the increased use of tiled and chiplet-based design, and even packages combining “big/little” cores.

When running any benchmark (or indeed other tool), it’s important to understand what the test is doing and what variables may affect its outcome. Otherwise, you will have no context for whether the results are meaningful or expected. Being able to repeat and have a consistent set of tests from server to server and platform to platform is key!

It is key to check that results react accordingly to overclock changes and align with the monitoring output.

Consider…

IPC is instruction-dependent. This means that different benchmarks will give different IPC numbers, so comparison of like-for-like is key.

Frequency is architecture-dependent. This means that we can not blindly compare performance statistics, such as CPU clock speed, across architectures. 5GHz on an AMD processor is not the same as 5GHz on an Intel processor. In fact, even within a CPU vendor, there may not be consistency on the measured results; 5GHz on Raptor Lake is not the same as 5GHz on Sapphire Rapids.

Hyper-threading will affect metrics like IPC & frequency. Hyper-threading often cuts IPC by ~50% but doubles the number of threads. Note that it is not an exact 50% reduction as there are other overheads to be considered.

Instruction set matters. Whether the software application code is using SSE or the various AVX levels will affect performance. AVX will often result in a down-clock in frequency, which may be alarming at first, but could still result in faster performance due to the increased throughput of certain AVX instructions.

Best practices from the past may not apply to new architectures. We’ve found that even small architecture differences can result in wildly different behaviour, so it’s always best to start with a fresh sheet of paper approach, but rely on existing experience to determine the direction of overclock development.

It is imperative to use a variety of tools to collect metrics on how an overclock is affecting system performance and cross-reference their results against known good benchmarks. Knowledge and experience of the operating parameters, tools, and benchmark results are key to match those results against expected outcomes to ensure that the stated overclocking goals are met, reliable, and will perform as expected throughout the lifecycle of the product. If you are self-overclocking in your lab or have an alternate vendor product in your racks, it's key to understand how the benchmarking is taking place, what tools are being used, and that the system has been adequately stress tested before it moves to production.

If you would like to learn more about Blackcore, please email your account manager. If you are new to Blackcore and would like to learn more, you can email [email protected] or book a call with our CRO Ciaran Kennedy here.

ICON

NEW

ACE

NEW

FLEX