Event Tracing for Windows (ETW) has been around for a while now, but many (if not most) developers have never used it. I’ve primarily used it for performance tracing, and it’s flexible enough to be used for logging regular application events as well.
ETW is implemented in the kernel and can easily handle 100,000 events/second. Since many kernel components are instrumented with ETW events, an event trace can provide an incredibly detailed analysis of system activity. This also means that you can analyze your own application to understand its performance characteristics with respect to CPU usage, disk/network I/O, context switches, garbage collections, etc.
Unfortunately, as a kernel-level feature, ETW has a relatively steep learning curve, and the existing tools are not the most user-friendly. For typical C++/C# developers, I’d recommend getting started collecting traces with PerfView. It comes with a fair amount of documentation, and there’s a series of video tutorials covering the basics.
- Try to keep performance traces as short as possible. 5GB traces are unwieldy when you only need to root cause a 3 second spike in CPU usage.
- If you know your scenario is CPU bound, skip collecting context switch events, which occur more frequently.
- If CPU usage is low but you still have performance problems, look at context switches to understand the reason for blocking.
- Consuming ETW events usually requires administrative privileges.
- A trace can be captured on one machine and analyzed on a different one (useful for collecting traces on slower/older hardware and then diagnosing the problem on something more current).
- When opening a trace and viewing a lot of data, use the largest monitor you can find.
- Disk read/write events only occur when a request is satisfied by the disk. If a request was answered from the OS cache, it will only trigger File IO events.
- Callstacks may be reported as broken if they are more than 192 frames deep. This is ok, but means you need to perform a bottom-up analysis instead of a top-down analysis.
- ETW event delivery is best-effort, not guaranteed. A few lost events are ok, but if a high percentage of events are lost in your trace, be careful about drawing conclusions. You’re probably logging events faster than the disk can write them. Try logging to a separate physical disk, a network share, or using an in-memory circular buffer.
Next time you have a performance problem, don’t guess what the problem might be. Collect an ETW trace and discover exactly what is occurring.