With the vast range of computer systems for casual, enthusiast, and enterprise clients; there’s a similar question - “Is the system fast enough, could it be faster?”.

Interestingly enough, there aren’t many components that make a system ‘fast’, from hardware to software level.

There’s many factors that keep your system from reaching its maximum potential.
Lets talk about them.

Understanding Latency

Latency - The delay between an instruction to transfer data and the data being transferred
Sounds like latency is speed? It is! Although they are used mostly in different concepts.

Here is how it is used colloquially in the technology scope (this can vary based on the concept)

Timer Start → Action → Action Finished → Stop Timer
Latency (Time elapsed to conduct an action) Ex: [ms, μs]

Timer Start → Action(s) → Stop Timer
Speed (# of action(s) over a specified unit of time elapsed) Ex: [Gb/s, MHz]

Now you understand both concepts, it is equally important to know how each affect from kernel → usermode
Furthermore, it is necessary to know how hardware affects both of these concepts.

Windows Kernel

The Windows Kernel (NT Kernel) is the core component of the Windows OS; it is the foundation of crucial services such as hardware abstraction, memory management, input/output handling, and security measures.

The NT Kernel serves as the bridge between user-level applications and system hardware, by managing hardware resources and providing necessities to user-mode applications.

Understanding Interrupts

Interrupts (IRQs):
The signal sent to the processor to yield the execution of the current task to the interrupt handler to address a time-sensitive event.

These signals are called IRQ signals; they are called from hardware or software requests.
interrupt handler (ISR):
The interrupt service routine handles the interrupt request.
The ISR is designed to handle IRQs as quickly as possible to resume yielded system operations.

If the request duration is too long or complex the ISR defers the additional work to a DPC.
Deferred Procedure Call (DPC):
A mechanism designed to complete the tasks passed from the ISR or the tasks that need to be executed when the system resources are not occupied by other higher-priority tasks.
The types of DPCs include DpcForIsr & CustomDPC to handle the process of IRQ I/O execution.

DpcForIsr: Directly handles calls triggered from the ISR
CustomDPC: A user-created DPC that is designed to handle non-critical tasks but, these tasks are given higher prioritization than normal threads & don’t need to be signaled from the ISR.

DPCs help avoid nonessential delays by executing their tasks at a lower priority (IRQL) than the ISRs. This ensures higher system optimization due to less critical work being deferred (to DPCs) until the system is under less load to handle the requests.

DPCs are often triggered by ISRs than other tasks keeping the IRQs handled by the ISR at maximum priority.

Monitor DPC & ISR routines with tracelog

Create tracelog session

1$tracelog = "C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64\tracelog.exe"
2$desktop = [Environment]::GetFolderPath("Desktop")
3
4Write-Host "Starting trace session (Five seconds)"
5& $tracelog -start -f "$desktop\dpcisr_trace-report.etl" -dpcisr -UsePerfCounter -b 64
6Start-Sleep -Seconds 5; & $tracelog -stop
7
8tracerpt "$desktop\dpcisr_trace-report.etl" -report "$desktop\dpcisr_trace-report.html" -f HTML
9Write-Host "Files saved to $desktop."

Windows Driver Kit is required to use Tracelog; Download it here

Monitoring the interrupts from ISR & DPC routines is a critical component for pin-pointing system wide issues such as: freezes, priority management, and overall system latency!

Understanding the process behind these procedures grant access into troubleshooting core problems with your system and provides deeper insight to overlying questions.

Exploring Usermode (Ring 3)

Usermode - the highest level of an operating system with the most restricted privilege access & process limitations.

Within usermode there are still factors that affect process execution and system runtime.
Understanding how these operations work allows the prioritization of essential processes, and system routines, and imparts improving system responsiveness & performance!

Understanding Usermode Operations

System calls (syscall)
A system call, also referred to as ‘syscall’, is a mechanism where a usermode program can interact with the kernel. These syscalls occur when an application requests for an event, signal, or information from the kernel.

Remember interrupts? When executing a syscall an interrupt occurs and it is tasked to then be executed.

Unnecessary & too frequent syscalls lead to impacted performance regarding scheduling & system slowdown due to a large amount of overhead actions to process. You cannot modify external applications to make them conduct fewer syscalls, so what can you do?

Reduce the number of applications running on a system!
The best way to limit the amount of interrupts for a program is to prevent its execution!

p.s: don’t have an irrational fear of syscalls, most applications involve hundreds & thousands of them
Thread management
System thread is a thread that the kernel spawns automatically during system initialization
Thread - the smallest sequence of instructions that can be executed independently, typically without interrupting other execution

A context switch is the action of saving the state of a process or thread so it can resume later and restores the state of another process or thread to allow its execution

Threads are essential for achieving parallelism and effectively distributing memory overhead for maintaining efficiency of a system or application(s). Although, poorly managed threads will directly impact performance due to resource contention & delays from overly frequent context switching.

To avoid poor thread usage within an application it’s important to avoid unnecessary thread creation, utilize thread priorities, establish proper synchronization between or across threads, and set thread affinities. The combination of these practices will help achieve optimal performance & prevent resource contention.

Can you set affinities or priorities for external applications? Of course!
This not be persistent & requires reapplication upon program exit

[Setting Affinity]
Open Task Manager → Details → Right-Click Process → Set Affinity
Choose which CPU cores to be utilized; Done!
[Setting Priority]
Open Task Manager → Details → Right-Click Process → Set Priority
Choose a priority between [Low-Realtime]; Done!

the current process of setting priority & affinity is application-wide & not individual thread control

Hardware Influence

Hardware latency - time elapsed for data to travel between components, be processed, or complete a specific operation at the hardware level.

The hardware used for a system creates a vital role in the minimum latency that can be achieved as well as the maximum performance output. We’ll explore these concepts further explain how each component delivers its own affect on the system.

Motherboard

The motherboard is most over-looked and important component of a system; it sets the prerequisites for all possible device integration within a system. The properties need to be evaluated to assess potential conflicts or maximum configuration specifications.

These properties directly put limits(sometimes indirectly) on speed for hardware devices, these limitations include but not limited to the following:

RAM [ Supported frequency | Supported DDR SDRAM ]
CPU [ Supported chipset | VRM Capacity ]
SSD & GPU [ Supported PCIe gen | Number of PCIe slots ]
Network [ Supported Gb/s | Wireless Support ]
MISC [ USB Port Gen. | SATA Support | PCIe lane sharing | RAID Support ]

The UEFI component within the motherboard controls all the non-physical bound property limitations.

CPU

The central processing unit, abbreviated CPU, is the main processor of a given system.
The CPU is responsible for performing logic, control, arithmetic, input and output operations specified by its architecture to perform system tasks. While a CPU Core is, a single processing unit within the CPU that can execute instructions.

Clock Speed - measurement of the number of cycles the CPU executes per second
IPC - measurement of the amount of instructions executed per clock cycle

The clock speed is typically(unfortunately) used for measuring performance of a CPU but, it’s a critical measurement to consider when evaluating performance. With a multitude of methods to evaluate performance accurately, acknowledge that different tests provide unique aspects towards certain goals; Let’s view them.

SCP (Single-Core performance)
Shows the performance of an individual core highlighting, IPC results & evaluating clock speed.
MCP (Multi-Core performance)
Shows the performance of multiple cores highlighting everything in SCP with parallel execution.
SMT (Simultaneous Multithreading)
Shows the performance of the CPU managing multiple threads per core, highlighting resource utilization.
PPW (Performance per Watt)
Shows how a CPU operates under different workloads with evaluating power efficiency
PPW = (Benchmark/Wattage)
CMP (Cache memory performance) | Advanced Topic [here]
The latency of the CPU cache(L1,L2,L3) accessing RAM; highlighting speed of memory requests
CMP = (Cache Access Time Hit Rate) + (Miss Rate * Memory Access Time)

Will CPU overclocking always reduce latency?

Absolutely not! Remember latency is not a direct concept & can be introduced by the factors discussed and many others. Increasing clock speed will not always increase IPC measurements, benchmark score, and improve system latency.

RAM

Random Access Memory, abbreviated as RAM, is the hardware component that is utilized by system for allocating, referencing, and moving memory temporarily.

RAM allows the CPU to quickly access memory for tasks such as, executing applications, managing system operations & active processes, supporting multitasking, and more.

RAM Speed (memory bandwidth) - the rate data can be read from or written to the RAM
RAM Latency - the duration the CPU takes to retrieve data from the RAM
RAM Timings - the measurement of the latencies for the basic operations for the RAM [CL,tRCD,tRP,tRAS]

Overall, the higher the speed & lower the timings the better but, like everything else each timing tells a story.
Let’s view them.

CAS Latency (CL):
The amount of clock cycles required to access data from a memory cell after a read command.
tRCD (Row Column Delay):
The latency between accessing a row in memory & starting r/w columns within it.
tRP (Row Precharge Time):
The duration to close a row in memory before opening a new one.
tRAS (Row Active Time):
The minimum cycle amount a row needs open to effectively write data.

Allowing high RAM speed & lower timings directly enhances performance by improving system responsiveness, reducing latency, and improved multitasking!

Conclusion

Latency happens within every component of hardware & software for every system. Evaluating latency even inside controlled environments can be difficult and time consuming process; making diagnosing issues a daunting process.

A well optimized system allows for efficient multitasking, bottleneck prevention, minimal latency output, and displays superior performance under workload.

Remember, even though latency can arise from seemingly random factors, understanding the problem brings you 50% closer to solving it.

Latency isn’t an obstacle, but an opportunity to innovate - Team Artemis

Demystifying System Latency