Microsoft Azure Service Profiler

Service Profiler Overview

Service Profiler is a performance analysis tool used by teams at Microsoft running large-scale services in the cloud, and is optimized for troubleshooting issues in production. Service Profiler makes it easy to collect performance data while your service is handling real production load, collecting detailed request duration metrics, deep callstacks, and memory snapshots - but it also makes sure to do this in a low-impact way to minimize overhead to your system. Analysis is facilitated through a browser-based tool that aggregates request durations by percentile, with the ability to deep dive into detailed request timelines and memory snapshots.

One key challenge that makes investigating performance problems in production particularly hard is that they often represent the issues that go undetected in your own testing during development. They are difficult to reproduce because some problems only surface when the service is running at scale. Or, sometimes an issue is intermittent because a particular customer is using your service in a way you didn't anticipate. Service Profiler helps by being able to collect deep, insightful data as problems are occurring - you also have fine-grained control over how and when the profiling agent runs.

Service Profiler Architecture


Monitor Performance with the Service Profiler Agent

The Service Profiler agent is deployed on each server for which you want to profile performance data. The agent runs continuously and is highly configurable so you can tell it the events you want to monitor, the data it should collect, where data is stored for later retrieval, and how often the agent should be actively sampling. These settings enable you to determine the right balance between collecting useful data to aid your investigations and the overhead incurred that is acceptable for your system. The most impactful way of lowering Service Profiler's overhead is to reduce its sampling rate to a low percentage setting - the consequence of this is that you'll need to wait longer to have detailed performance data to analyze. Read the full set of available configuration settings.

Custom Instrumentation

Service Profiler already has built-in support for monitoring common application events like http requests and SQL database calls, as well as the ability to capture detailed call stacks, so no special application code is required to get started, but you can choose to augment this with your own custom EventSource Activity instrumentation.

Organizing Performance Data with Data Cubes

You can organize the location of your application's performance data by creating Data Cubes. For example, you can create Data Cubes called 'myapptest', 'myappprod', etc. for separating data by environment; and/or a Data Cube for each application: 'MyApp', 'MyOtherApp'. Data Cubes can be managed in the URL '/#/<DataCubeName>/configure/agent-settings', and each instance of the agent is configured to point to a specific Data Cube in its json settings file.


View Response Time Percentiles

Once performance data is being monitored and collected, you can view performance metrics with a browser by navigating to the URL of a Data Cube: '/summary/<DataCubeName>'.

The dashboard lists the total number of requests that have been monitored for a given timeframe, and an aggregate summary of how long they took. Importantly, the distribution of request durations is displayed using percentile buckets so as not to mask any infrequent but statistically significant performance anomalies. This provides an overall picture of how consistently your application is performing. For example, you need to be aware and determine the impact to your users if they experience good performance most of the time, but significantly slower performance some of the time.

Service Profiler Dashboard

Service Profiler Dashboard

The 95th percentile example below can be read as:

Out of a total of 57,078 requests that occurred, 2,854 requests took 8,048 ms or more.

Service Profiler Percentiles

Some common patterns to look for are:

Samples

Out of the potentialy thousands of incoming requests that Service Profiler monitors in any given hour, only a small handful of requests are selected by Service Profiler to produce detailed callstack and memory snapshots. This approach minimizes overhead, with the practical benefit that typically just a small set of samples need to inspected when investigating a problematic request.

Clicking any percentile will display a list of samples that have been collected. Following a link will open the detailed request timeline view for that sample.

Service Profiler Percentile Popover


Detailed Timeline View

Drilling into a sample reveals a detailed timeline view. Service Profiler collects time-based ETW traces that is useful for investigating CPU wall clock vs. blocked time. Remote calls to SQL and HTTP services are captured, as well as a full call stack that can help you dive deep into understanding where CPU time was spent on local computational work. Although not required, any custom EventSource Activity instrumentation in your code will also be displayed as activities with duration metrics (provided event names follow the EventName_Start and EventName_Stop convention).

Detailed Timeline View


Next Steps