何志文 2020-05-16
转自:https://engineering.shopify.com/blogs/engineering/how-fix-slow-code-ruby
At Shopify, we believe in highly aligned, loosely coupled teams to help us move fast. Since we have many teams working independently on a large monolithic Rails application, inefficiencies in code are sometimes inadvertently added to our codebase. Over time, these problems can add up to serious performance regressions.
By the time such performance regressions are noticeable, it might already be too late to track offending commits down. This can be exceedingly challenging on codebases with thousands of changes being committed each day. How do we effectively find out why our application is slow? Even if we have a fix for the slow code, how can we prove that our new code is faster?
It all starts with profiling and benchmarking. Last year, we wrote about writing fast code in Ruby on Rails. Knowing how to write fast code is useful, but insufficient without knowing how to fix slow code. Let’s talk about the approaches that we can use to find slow code, fix it, and prove that our new solution is faster. We’ll also explore some case studies that feature real world examples on using profiling and benchmarking.
Before we dive into fixing unperformant code, we need to find it first. Identifying code that causes performance bottlenecks can be challenging in a large codebase. Profiling helps us to do so easily.
Profiling is a type of program analysis that collects metrics about the program at runtime, such as the frequency and duration of method calls. It’s carried out using a tool known as a profiler, and a profiler’s output can be visualized in various ways. For example, flat profiles, call graphs, and flamegraphs.
Some issues are challenging to detect by just looking at the code (static analysis, code reviews, etc.). One of the main goals of profiling is observability. By knowing what is going on under the hood during runtime, we gain a better understanding of what the program is doing and reason about why an application is slow. Profiling helps us to narrow down the scope of a performance bottleneck to a particular area.
Before we figure out what to profile, we need to first figure out what we want to know: do we want to measure elapsed time for a specific code block, or do we want to measure object allocations in that code block? In terms of granularity, do we need elapsed time for every single method call in that code block, or do we just need the aggregated value? Elapsed time here can be further broken down into CPU time or wall time.
For measuring elapsed time, a simple solution is to measure the start time and the end time of a particular code block, and report the difference. If we need a higher granularity, we do this for every single method. To do this, we use the TracePoint API in Ruby to hook into every single method call made by Ruby. Similarly, for object allocations, we use the ObjectSpace module to trace object allocations, or even dump the Ruby heap to observe its contents.
However, instead of building custom profiling solutions, we can use one of the available profilers out there, and each has its own advantages and disadvantages. Here are a few options:
1. rbspy
rbspy
samples stack frames from a Ruby process over time. The main advantage is that it can be used as a standalone program without needing any instrumentation code.
Once we know the Ruby Process Identifier (PID) that we want to profile, we start the profiling session like this:
rbspy record —pid $PID
2. stackprof
Like rbspy
, stackprof
samples stack frames over time, but from a block of instrumented Ruby code. Stackprof is used as a profiling solution for custom code blocks:
profile = StackProf.run(mode: :cpu) do<br /> # Code to profile<br />end
The rack-mini-profiler
gem is a fully-featured profiling solution for Rack-based applications. Unlike the other profilers described in this section, it includes a memory profiler in addition to call-stack sampling. The memory profiler collects data such as Garbage Collection (GC) statistics, number of allocations, etc. Under the hood, it uses the stackprof
and memory_profiler
gems.
4. app_profiler
app_profiler
is a lightweight alternative to rack-mini-profiler
. It contains a Rack-only middleware that supports call-stack profiling for web requests. In addition to that, block level profiling is also available to any Ruby application. These profiles can be stored in a configurable storage backend such as Google Cloud Storage, and can be visualized through a configurable viewer such as Speedscope, a browser-based flamegraph viewer.
At Shopify, we collect performance profiles in our production environments. Rack Mini Profiler is a great gem, but it comes with a lot of extra features such as database and memory profiling, and it seemed too heavy for our use case. As a result, we built App Profiler that similarly uses Stackprof under the hood. Currently, this gem is used to support our on-demand remote profiling infrastructure for production requests.
An example of a performance problem that was identified in production was related to unnecessary GC cycles. Last year, we noticed that a cart item with a very large quantity used a ridiculous amount of CPU time and resulted in slow requests. It turns out, the issue was related to Ruby allocating too many objects, triggering the GC multiple times.
The figure below illustrates a section of the flamegraph for a similar slow request, and the section corresponds to approximately 500ms of CPU time.
A section of the flamegraph for a similar slow request
The highlighted chunks correspond to the GC operations, and they interleave with the regular operations. From this section, we see that GC itself consumed about 35% of CPU time, which is a lot! We inferred that we were allocating too many Ruby objects. Without profiling, it’s difficult to identify these kinds of issues quickly.
Now that we know how to identify performance problems, how do we fix them? While the right solution is largely context sensitive, validating the fix isn’t. Benchmarking helps us prove performance differences in two or more different code paths.
Benchmarking is a way of measuring the performance of code. Often, it’s used to compare two or more similar code paths to see which code path is the fastest. Here’s what a simple ruby benchmark looks like:
# frozen_string_literal: true | |
require "benchmark" | |
def some_method | |
sleep(1) # pause execution for 1 second | |
end | |
seconds = Benchmark.realtime { some_method } | |
puts "The method took #{seconds.truncate(2)} seconds." |
This code snippet is benchmarking at its simplest. We’re measuring how long a method takes to run in seconds. We could extend the example to measure a series of methods, a complex math equation, or anything else that fits into a block. This kind of instrumentation is useful because it can unveil regression or improvement in speed over time.
While wall time is a pretty reliable measurement of “performance”, there’s other methods one can measure code by besides realtime
, the Ruby standard library’s Benchmark
module includes bm
and bmbm
.
The <a target="_blank" href="https://www.ancii.com/link/v1/GRiCsrE-wFG89Nq6imY-uskfXWHCR7XaD2hHq8PrpRrTTdM5yCPTs1Z0Yng8HatlgLepaoqEHOmH1QqmYyP2xzOXCOkQ_2tEccb2Wls7noO1uGdhv9sW-XWyCtdxVoTD/" rel="nofollow" title="bm">bm</a>
method shows a more detailed breakdown of timing measurements. Let’s take a look at a script with some output:
# frozen_string_literal: true | |
require "benchmark" | |
def some_method | |
sleep(1) | |
end | |
Benchmark.bm do |x| | |
x.report("method") { some_method } | |
end |
user system total real | |
method 0.000040 0.000030 0.000070 ( 1.002846) |
User, system, and total are all different measurements of CPU time. User refers to time spent working in user space. Similarly, system denotes time spent working in kernel space. Total is the sum of CPU timings, and real is the same wall time measurement we saw from Benchmark.realtime
.
What about <a target="_blank" href="https://www.ancii.com/link/v1/GRiCsrE-wFG89Nq6imY-uskfXWHCR7XaD2hHq8PrpRrTTdM5yCPTs1Z0Yng8HatlgLepaoqEHOmH1QqmYyP2xzOXCOkQ_2tEccb2Wls7noNkfB3jOgtohI7npCL1Qg0P/" rel="nofollow" title="bmbm">bmbm</a>
? Well, it is exactly the same as bm
with one unique difference. Here’s what the output looks like:
Rehearsal ------------------------------------------ | |
method 0.000047 0.000035 0.000082 ( 1.005247) | |
--------------------------------- total: 0.000082sec | |
user system total real | |
method 0.000033 0.000037 0.000070 ( 1.003529) |
The rehearsal, or warmup step is what makes bmbm
useful. It runs benchmark code blocks once before measuring to prime any caching or similar mechanism to produce more stable, reproducible results.
Lastly, let’s talk about the <a target="_blank" href="https://www.ancii.com/link/v1/Dl4zefi9Yghr3VWBBEc_5lg_33trlSQTy6Xs7W9hm3sBiL8Lt6JxNJcM_w314qo9/" rel="nofollow" title="benchmark-ips">benchmark-ips</a>
gem. This is the most common method of benchmarking Ruby code. You’ll see it a lot in the wild, this is what a simple script looks like:
# frozen_string_literal: true | |
require "bundler/inline" | |
gemfile(true) do | |
source "https://rubygems.org" | |
gem "benchmark-ips" | |
end | |
require "benchmark/ips" | |
def some_method | |
sleep(1) | |
end | |
Benchmark.ips do |x| | |
x.report("method") { some_method } | |
end |
Here, we’re benchmarking the same method using familiar syntax with ips
method. Notice the inline bundler and gemfile code. We need this in a scripting context because benchmark-ips
isn’t part of the standard library. In a normal project setup, we add gem entries to the Gemfile
as usual.
The output of this script is as follows:
Warming up -------------------------------------- | |
method 1.000 i/100ms | |
Calculating ------------------------------------- | |
method 0.997 (± 0.0%) i/s - 5.000 in 5.015979s |
Ignoring the bundler output, we see the warmup iteration score per 100 milliseconds ran for the default of 2 seconds, and how many times the code block was able to run in 5 seconds. It’ll become more apparent why benchmark-ips
is so popular later.
So, now we know what benchmarking is and some tools available to us. But why even bother benchmarking at all? It may not be immediately obvious why benchmarking is so valuable.
Benchmarks are used to quantify the performance of one or more blocks of code. This becomes very useful when there are performance questions that need answers. Often, these questions boil down to “which is faster, A or B?”. Let’s look at an example:
# frozen_string_literal: true | |
require "bundler/inline" | |
gemfile(true) do | |
source "https://rubygems.org" | |
gem "benchmark-ips" | |
end | |
require "benchmark/ips" | |
def some_method | |
sleep(1) | |
end | |
def other_method | |
sleep(1.1) | |
end | |
Benchmark.ips do |x| | |
x.report("some method") { some_method } | |
x.report("other method") { other_method } | |
x.compare! | |
end |
In this script, we’re doing most of what we did in the first benchmark-ips
example. Pay attention to the addition of another method, and how it changes the benchmark block. When benchmarking more than one thing at once, simply add another report
block. Additionally, the compare!
method prints a comparison of all reports:
Warming up -------------------------------------- | |
some method 1.000 i/100ms | |
other method 1.000 i/100ms | |
Calculating ------------------------------------- | |
some method 0.997 (± 0.0%) i/s - 5.000 in 5.013651s | |
other method 0.907 (± 0.0%) i/s - 5.000 in 5.513106s | |
Comparison: | |
some method: 1.0 i/s | |
other method: 0.9 i/s - 1.10x slower |
Wow, that’s pretty snazzy! compare!
is able to tell us which benchmark is slower and by how much. Given the amount of thread sleeping we’re doing in our benchmark subject methods, this aligns with our expectations.
Benchmarking can be a means of proving how fast a given code path is. It’s not uncommon for developers to propose a code change that makes a code path faster without any evidence.
Depending on the change, comparison can be challenging. As in the previous example, benchmark-ips
may be used to benchmark individual code paths. Running the same single report benchmark on versions of code easily tests pre and post patch performance.
Now we know what benchmarking is and why it is important. Great! But how do you get started benchmarking in an application? Trivial examples are easy to learn from but aren’t very relatable.
When developing in a framework like Ruby on Rails, it can be difficult to understand how to set up and load framework code for benchmark scripts. Thankfully, one of the latest features of Ruby on Rails can generate benchmarks automatically. Let’s take a look:
# frozen_string_literal: true | |
require_relative "../../config/environment" | |
# Any benchmarking setup goes here... | |
Benchmark.ips do |x| | |
x.report("before") { } | |
x.report("after") { } | |
x.compare! | |
end |
This benchmark can be generated by running bin/rails generate benchmark my_benchmark
, placing a file in script/benchmarks/my_benchmark.rb
. Note the inline gemfile isn’t required because we piggyback off of the Rails app’s Gemfile
. The benchmark generator is slated for release in Rails 6.1.
Now, let’s look at a real world example of a Rails benchmark:
# frozen_string_literal: true | |
require_relative "../../config/environment" | |
class OrderWithCachedPrice < Order | |
def price | |
Rails.cache.fetch("order_#{id}_total") { super } | |
end | |
end | |
order = Order.first_or_create! | |
order.line_items.create!(name: "Rocket Shoes", price: 97.98) | |
order.line_items.create!(name: "Invisible Ink", price: 5.67) | |
order.line_items.create!(name: "Ray Gun", price: 1_235.97) | |
order.line_items.create!(name: "Chunky Bacon", price: 8.99) | |
order_with_cache = OrderWithCachedPrice.first | |
Benchmark.ips do |x| | |
x.report("before") { order.price } | |
x.report("after") { order_with_cache.price } | |
x.compare! | |
end |
In this example, we’re subclassing Order
and caching the calculation it does to find the total price of all line items. While it may seem obvious that this would be a beneficial code change, it isn’t obvious how much faster it is compared to the base implementation. Here’s a more unabridged version of the script for full context.
Warming up -------------------------------------- | |
before 166.000 i/100ms | |
after 9.376k i/100ms | |
Calculating ------------------------------------- | |
before 1.797k (± 5.1%) i/s - 8.964k in 5.001974s | |
after 96.136k (± 8.0%) i/s - 478.176k in 5.010413s | |
Comparison: | |
after: 96135.8 i/s | |
before: 1796.9 i/s - 53.50x slower |
Running the script reveals a ~50x improvement to a simple order of 4 line items. With orders with more line items, the payoff only gets better.
One last thing to know about benchmarking effectively is being aware of micro-optimization. These are optimizations that are so small, the performance improvement isn’t worth the code change. While these are sometimes acceptable for hot code paths, it’s best to tackle larger scale performance issues first.
As with many open source projects, Ruby on Rails usually requires performance optimization pull requests to include benchmarks. The same is common for new features to performance sensitive areas like Active Record query building or Active Support’s cache stores. In the case of Rails, most benchmarks are made with benchmark-ips
to simplify comparison.
For example, https://github.com/rails/rails/pull/36052 changes how primary keys are accessed in Active Record instances. Specifically, refactoring class method calls to instance variable references. It includes before and after benchmark results with a clear explanation of why the change is necessary.
https://github.com/rails/rails/pull/38401 changes model attribute assignment in Active Record so that key stringification of attribute hashes is no longer needed. A benchmark script with multiple scenarios is provided with results. This is a particularly hot codepath because creating and updating records is at the heart of most Rails apps.
Another example, https://github.com/rails/rails/pull/34197 reduces object allocations in ActiveRecord#respond_to?
. It provides a memory benchmark that compares total allocations before and after the patch, with a calculated diff. Reducing allocations delivers better performance because the less Ruby allocates, the less time Ruby spends assigning objects to blocks of memory.
Slow code is an inevitable facet of any codebase. It isn’t important who introduces performance regressions, but how they are fixed. As developers, it’s our job to leverage profiling and benchmarking to find and fix performance problems.
At Shopify, we’ve written a lot of slow code, often for good reasons. Ruby itself is optimized for the developer, not the servers we run it on. As Rubyists, we write idiomatic, maintainable code that isn’t always performant, so profile and benchmark responsibly, and be wary of micro-optimizations!