How to Fix Slow Code in Ruby

何志文 2020-05-16

转自:https://engineering.shopify.com/blogs/engineering/how-fix-slow-code-ruby

At Shopify, we believe in highly aligned, loosely coupled teams to help us move fast. Since we have many teams working independently on a large monolithic Rails application, inefficiencies in code are sometimes inadvertently added to our codebase. Over time, these problems can add up to serious performance regressions.

By the time such performance regressions are noticeable, it might already be too late to track offending commits down. This can be exceedingly challenging on codebases with thousands of changes being committed each day. How do we effectively find out why our application is slow? Even if we have a fix for the slow code, how can we prove that our new code is faster?

It all starts with profiling and benchmarking. Last year, we wrote about writing fast code in Ruby on Rails. Knowing how to write fast code is useful, but insufficient without knowing how to fix slow code. Let’s talk about the approaches that we can use to find slow code, fix it, and prove that our new solution is faster. We’ll also explore some case studies that feature real world examples on using profiling and benchmarking.

Profiling

Before we dive into fixing unperformant code, we need to find it first. Identifying code that causes performance bottlenecks can be challenging in a large codebase. Profiling helps us to do so easily.

What is Profiling?

Profiling is a type of program analysis that collects metrics about the program at runtime, such as the frequency and duration of method calls. It’s carried out using a tool known as a profiler, and a profiler’s output can be visualized in various ways. For example, flat profiles, call graphs, and flamegraphs.

Why Should I Profile My Code?

Some issues are challenging to detect by just looking at the code (static analysis, code reviews, etc.). One of the main goals of profiling is observability. By knowing what is going on under the hood during runtime, we gain a better understanding of what the program is doing and reason about why an application is slow. Profiling helps us to narrow down the scope of a performance bottleneck to a particular area.

How Do I Profile?

Before we figure out what to profile, we need to first figure out what we want to know: do we want to measure elapsed time for a specific code block, or do we want to measure object allocations in that code block? In terms of granularity, do we need elapsed time for every single method call in that code block, or do we just need the aggregated value? Elapsed time here can be further broken down into CPU time or wall time.

For measuring elapsed time, a simple solution is to measure the start time and the end time of a particular code block, and report the difference. If we need a higher granularity, we do this for every single method. To do this, we use the TracePoint API in Ruby to hook into every single method call made by Ruby. Similarly, for object allocations, we use the ObjectSpace module to trace object allocations, or even dump the Ruby heap to observe its contents.

However, instead of building custom profiling solutions, we can use one of the available profilers out there, and each has its own advantages and disadvantages. Here are a few options:

1rbspy

rbspy samples stack frames from a Ruby process over time. The main advantage is that it can be used as a standalone program without needing any instrumentation code.

Once we know the Ruby Process Identifier (PID) that we want to profile, we start the profiling session like this:

rbspy record —pid $PID

2stackprof

Like rbspystackprof samples stack frames over time, but from a block of instrumented Ruby code. Stackprof is used as a profiling solution for custom code blocks:

profile = StackProf.run(mode: :cpu) do<br />  # Code to profile<br />end

3rack-mini-profiler

The rack-mini-profiler gem is a fully-featured profiling solution for Rack-based applications. Unlike the other profilers described in this section, it includes a memory profiler in addition to call-stack sampling. The memory profiler collects data such as Garbage Collection (GC) statistics, number of allocations, etc. Under the hood, it uses the stackprof and memory_profiler gems.

4app_profiler

app_profiler is a lightweight alternative to rack-mini-profiler. It contains a Rack-only middleware that supports call-stack profiling for web requests. In addition to that, block level profiling is also available to any Ruby application. These profiles can be stored in a configurable storage backend such as Google Cloud Storage, and can be visualized through a configurable viewer such as Speedscope, a browser-based flamegraph viewer.

At Shopify, we collect performance profiles in our production environments. Rack Mini Profiler is a great gem, but it comes with a lot of extra features such as database and memory profiling, and it seemed too heavy for our use case. As a result, we built App Profiler that similarly uses Stackprof under the hood. Currently, this gem is used to support our on-demand remote profiling infrastructure for production requests.

Case Study: Using App Profiler on Shopify

An example of a performance problem that was identified in production was related to unnecessary GC cycles. Last year, we noticed that a cart item with a very large quantity used a ridiculous amount of CPU time and resulted in slow requests. It turns out, the issue was related to Ruby allocating too many objects, triggering the GC multiple times.

The figure below illustrates a section of the flamegraph for a similar slow request, and the section corresponds to approximately 500ms of CPU time.

How to Fix Slow Code in Ruby

A section of the flamegraph for a similar slow request

The highlighted chunks correspond to the GC operations, and they interleave with the regular operations. From this section, we see that GC itself consumed about 35% of CPU time, which is a lot! We inferred that we were allocating too many Ruby objects. Without profiling, it’s difficult to identify these kinds of issues quickly.

Benchmarking

Now that we know how to identify performance problems, how do we fix them? While the right solution is largely context sensitive, validating the fix isn’t. Benchmarking helps us prove performance differences in two or more different code paths.

What is Benchmarking?

Benchmarking is a way of measuring the performance of code. Often, it’s used to compare two or more similar code paths to see which code path is the fastest. Here’s what a simple ruby benchmark looks like:

 # frozen_string_literal: true
  
 require "benchmark"
  
 def some_method
 sleep(1# pause execution for 1 second
 end
  
 seconds Benchmark.realtime some_method }
  
 puts "The method took #{seconds.truncate(2)} seconds."

This code snippet is benchmarking at its simplest. We’re measuring how long a method takes to run in seconds. We could extend the example to measure a series of methods, a complex math equation, or anything else that fits into a block. This kind of instrumentation is useful because it can unveil regression or improvement in speed over time.

While wall time is a pretty reliable measurement of “performance”, there’s other methods one can measure code by besides realtime, the Ruby standard library’s Benchmark module includes bm and bmbm.

The <a target="_blank" href="https://www.ancii.com/link/v1/GRiCsrE-wFG89Nq6imY-uskfXWHCR7XaD2hHq8PrpRrTTdM5yCPTs1Z0Yng8HatlgLepaoqEHOmH1QqmYyP2xzOXCOkQ_2tEccb2Wls7noO1uGdhv9sW-XWyCtdxVoTD/" rel="nofollow" title="bm">bm</a> method shows a more detailed breakdown of timing measurements. Let’s take a look at a script with some output:

 # frozen_string_literal: true
  
 require "benchmark"
  
 def some_method
 sleep(1)
 end
  
 Benchmark.bm do |x|
 x.report("method"some_method }
 end
 user system total real
 method 0.000040 0.000030 0.000070 ( 1.002846)

User, system, and total are all different measurements of CPU time. User refers to time spent working in user space. Similarly, system denotes time spent working in kernel space. Total is the sum of CPU timings, and real is the same wall time measurement we saw from Benchmark.realtime.


What about <a target="_blank" href="https://www.ancii.com/link/v1/GRiCsrE-wFG89Nq6imY-uskfXWHCR7XaD2hHq8PrpRrTTdM5yCPTs1Z0Yng8HatlgLepaoqEHOmH1QqmYyP2xzOXCOkQ_2tEccb2Wls7noNkfB3jOgtohI7npCL1Qg0P/" rel="nofollow" title="bmbm">bmbm</a>? Well, it is exactly the same as bm with one unique difference. Here’s what the output looks like:

 Rehearsal ------------------------------------------
 method 0.000047 0.000035 0.000082 ( 1.005247)
 --------------------------------- total: 0.000082sec
  
 user system total real
 method 0.000033 0.000037 0.000070 ( 1.003529)

The rehearsal, or warmup step is what makes bmbm useful. It runs benchmark code blocks once before measuring to prime any caching or similar mechanism to produce more stable, reproducible results.

Lastly, let’s talk about the <a target="_blank" href="https://www.ancii.com/link/v1/Dl4zefi9Yghr3VWBBEc_5lg_33trlSQTy6Xs7W9hm3sBiL8Lt6JxNJcM_w314qo9/" rel="nofollow" title="benchmark-ips">benchmark-ips</a> gem. This is the most common method of benchmarking Ruby code. You’ll see it a lot in the wild, this is what a simple script looks like:

 # frozen_string_literal: true
  
 require "bundler/inline"
  
 gemfile(truedo
 source "https://rubygems.org"
 gem "benchmark-ips"
 end
  
 require "benchmark/ips"
  
 def some_method
 sleep(1)
 end
  
 Benchmark.ips do |x|
 x.report("method"some_method }
 end

Here, we’re benchmarking the same method using familiar syntax with ips method. Notice the inline bundler and gemfile code. We need this in a scripting context because benchmark-ips isn’t part of the standard library. In a normal project setup, we add gem entries to the Gemfile as usual.

The output of this script is as follows:

 Warming up --------------------------------------
 method 1.000 i/100ms
 Calculating -------------------------------------
 method 0.997 (± 0.0%) i/s - 5.000 in 5.015979s

Ignoring the bundler output, we see the warmup iteration score per 100 milliseconds ran for the default of 2 seconds, and how many times the code block was able to run in 5 seconds. It’ll become more apparent why benchmark-ips is so popular later.

Why Should I Benchmark My Code?

So, now we know what benchmarking is and some tools available to us. But why even bother benchmarking at all? It may not be immediately obvious why benchmarking is so valuable.

Benchmarks are used to quantify the performance of one or more blocks of code. This becomes very useful when there are performance questions that need answers. Often, these questions boil down to “which is faster, A or B?”. Let’s look at an example:

 # frozen_string_literal: true
  
 require "bundler/inline"
  
 gemfile(truedo
 source "https://rubygems.org"
 gem "benchmark-ips"
 end
  
 require "benchmark/ips"
  
 def some_method
 sleep(1)
 end
  
 def other_method
 sleep(1.1)
 end
  
 Benchmark.ips do |x|
 x.report("some method"some_method }
 x.report("other method"other_method }
  
 x.compare!
 end

In this script, we’re doing most of what we did in the first benchmark-ips example. Pay attention to the addition of another method, and how it changes the benchmark block. When benchmarking more than one thing at once, simply add another report block. Additionally, the compare! method prints a comparison of all reports:

 Warming up --------------------------------------
 some method 1.000 i/100ms
 other method 1.000 i/100ms
 Calculating -------------------------------------
 some method 0.997 (± 0.0%) i/s - 5.000 in 5.013651s
 other method 0.907 (± 0.0%) i/s - 5.000 in 5.513106s
  
 Comparison:
 some method: 1.0 i/s
 other method: 0.9 i/s - 1.10x slower

Wow, that’s pretty snazzy! compare! is able to tell us which benchmark is slower and by how much. Given the amount of thread sleeping we’re doing in our benchmark subject methods, this aligns with our expectations.

Benchmarking can be a means of proving how fast a given code path is. It’s not uncommon for developers to propose a code change that makes a code path faster without any evidence.

Depending on the change, comparison can be challenging. As in the previous example, benchmark-ips may be used to benchmark individual code paths. Running the same single report benchmark on versions of code easily tests pre and post patch performance.

How Do I Benchmark My Code?

Now we know what benchmarking is and why it is important. Great! But how do you get started benchmarking in an application? Trivial examples are easy to learn from but aren’t very relatable.

When developing in a framework like Ruby on Rails, it can be difficult to understand how to set up and load framework code for benchmark scripts. Thankfully, one of the latest features of Ruby on Rails can generate benchmarks automatically. Let’s take a look:

 # frozen_string_literal: true
  
 require_relative "../../config/environment"
  
 # Any benchmarking setup goes here...
  
  
  
 Benchmark.ips do |x|
 x.report("before"}
 x.report("after"}
  
 x.compare!
 end

This benchmark can be generated by running bin/rails generate benchmark my_benchmark, placing a file in script/benchmarks/my_benchmark.rb. Note the inline gemfile isn’t required because we piggyback off of the Rails app’s Gemfile. The benchmark generator is slated for release in Rails 6.1.

Now, let’s look at a real world example of a Rails benchmark:

 # frozen_string_literal: true
  
 require_relative "../../config/environment"
  
 class OrderWithCachedPrice < Order
 def price
 Rails.cache.fetch("order_#{id}_total"super }
 end
 end
  
 order Order.first_or_create!
 order.line_items.create!(name: "Rocket Shoes"price: 97.98)
 order.line_items.create!(name: "Invisible Ink"price: 5.67)
 order.line_items.create!(name: "Ray Gun"price: 1_235.97)
 order.line_items.create!(name: "Chunky Bacon"price: 8.99)
 order_with_cache OrderWithCachedPrice.first
  
 Benchmark.ips do |x|
 x.report("before"order.price }
 x.report("after"order_with_cache.price }
  
 x.compare!
 end

In this example, we’re subclassing Order and caching the calculation it does to find the total price of all line items. While it may seem obvious that this would be a beneficial code change, it isn’t obvious how much faster it is compared to the base implementation. Here’s a more unabridged version of the script for full context.

 Warming up --------------------------------------
 before 166.000 i/100ms
 after 9.376k i/100ms
 Calculating -------------------------------------
 before 1.797k (± 5.1%) i/s - 8.964k in 5.001974s
 after 96.136k (± 8.0%) i/s - 478.176k in 5.010413s
  
 Comparison:
 after: 96135.8 i/s
 before: 1796.9 i/s - 53.50x slower

Running the script reveals a ~50x improvement to a simple order of 4 line items. With orders with more line items, the payoff only gets better.

One last thing to know about benchmarking effectively is being aware of micro-optimization. These are optimizations that are so small, the performance improvement isn’t worth the code change. While these are sometimes acceptable for hot code paths, it’s best to tackle larger scale performance issues first.

Case Study: Rails Contributions

As with many open source projects, Ruby on Rails usually requires performance optimization pull requests to include benchmarks. The same is common for new features to performance sensitive areas like Active Record query building or Active Support’s cache stores. In the case of Rails, most benchmarks are made with benchmark-ips to simplify comparison.

For example, https://github.com/rails/rails/pull/36052 changes how primary keys are accessed in Active Record instances. Specifically, refactoring class method calls to instance variable references. It includes before and after benchmark results with a clear explanation of why the change is necessary.

https://github.com/rails/rails/pull/38401 changes model attribute assignment in Active Record so that key stringification of attribute hashes is no longer needed. A benchmark script with multiple scenarios is provided with results. This is a particularly hot codepath because creating and updating records is at the heart of most Rails apps.

Another example, https://github.com/rails/rails/pull/34197 reduces object allocations in ActiveRecord#respond_to?. It provides a memory benchmark that compares total allocations before and after the patch, with a calculated diff. Reducing allocations delivers better performance because the less Ruby allocates, the less time Ruby spends assigning objects to blocks of memory.

Final Thoughts

Slow code is an inevitable facet of any codebase. It isn’t important who introduces performance regressions, but how they are fixed. As developers, it’s our job to leverage profiling and benchmarking to find and fix performance problems.

At Shopify, we’ve written a lot of slow code, often for good reasons. Ruby itself is optimized for the developer, not the servers we run it on. As Rubyists, we write idiomatic, maintainable code that isn’t always performant, so profile and benchmark responsibly, and be wary of micro-optimizations!

Additional Information

相关推荐