背景

近日有用户反馈Redis的流量统计有问题，实际出口流量比客户端监察到的还大，通过监控我们可以看到后端采集的Redis出口流量为以下图表，其中单位为KByte每秒，所以我们可以看到内核统计的有10MB左右的流量。

Redis流量统计问题如何解？阿里云工程师谈分析及修复

我们后端天象系统会从协议栈层面统计每个Redis实例的流量情况，同一时刻图表如下，我们可以发现出口流量在2MB左右，和Redis统计的有一定偏差。

Redis流量统计问题如何解？阿里云工程师谈分析及修复

Redis 流量统计原理

后端监控采集的Redis出口流量为info命令返回的instantaneous_output_kbps值，该值的计算方式为

(float)getInstantaneousMetric(STATS_METRIC_NET_OUTPUT)/1024

查看getInstantaneousMetric实现如下：

/* Return the mean of all the samples. */long long getInstantaneousMetric(int metric) { int j; long long sum = 0; for (j = 0; j &lt; STATS_METRIC_SAMPLES; j++) sum += server.inst_metric[metric].samples[j]; return sum / STATS_METRIC_SAMPLES;}

我们可以看到出口流量是由server.inst_metric里面根据统计的类型得到的一个平均值，继续查看server.inst_metric的计算函数为trackInstantaneousMetric实现如下：

/* Add a sample to the operations per second array of samples. */void trackInstantaneousMetric(int metric, long long current_reading) { long long t = mstime() - server.inst_metric[metric].last_sample_time; long long ops = current_reading - server.inst_metric[metric].last_sample_count; long long ops_sec; ops_sec = t &gt; 0 ? (ops*1000/t) : 0; server.inst_metric[metric].samples[server.inst_metric[metric].idx] = ops_sec; server.inst_metric[metric].idx++; server.inst_metric[metric].idx %= STATS_METRIC_SAMPLES; server.inst_metric[metric].last_sample_time = mstime(); server.inst_metric[metric].last_sample_count = current_reading;}

trackInstantaneousMetric在serverCtron里面定时调用，代码如下：

run_with_period(100) { trackInstantaneousMetric(STATS_METRIC_COMMAND,server.stat_numcommands); trackInstantaneousMetric(STATS_METRIC_NET_INPUT, server.stat_net_input_bytes); trackInstantaneousMetric(STATS_METRIC_NET_OUTPUT, server.stat_net_output_bytes);}

从以上函数我们可以看到流量的统计为定期对server.stat_net_output_bytes做统计计算得到的平均值，所以Redis出口流量计算的关键在于server.stat_net_output_bytes的计算，查看内核计算server.stat_net_output_bytes的代码如下：

/* Return true if the specified client has pending reply buffers to write to* the socket. */int clientHasPendingReplies(client *c) { return c-&gt;bufpos || listLength(c-&gt;reply);}/* Write data in output buffers to client. Return C_OK if the client* is still valid after the call, C_ERR if it was freed. */int writeToClient(int fd, client *c, int handler_installed) { ssize_t nwritten = 0, totwritten = 0; size_t objlen; size_t objmem; robj *o; while(clientHasPendingReplies(c)) { if (c-&gt;bufpos &gt; 0) { nwritten = write(fd,c-&gt;buf+c-&gt;sentlen,c-&gt;bufpos-c-&gt;sentlen); if (nwritten &lt;= 0) break; c-&gt;sentlen += nwritten; totwritten += nwritten; /* If the buffer was sent, set bufpos to zero to continue with * the remainder of the reply. */ if ((int)c-&gt;sentlen == c-&gt;bufpos) { c-&gt;bufpos = 0; c-&gt;sentlen = 0; } } else { o = listNodeValue(listFirst(c-&gt;reply)); objlen = sdslen(o-&gt;ptr); objmem = getStringObjectSdsUsedMemory(o); if (objlen == 0) { listDelNode(c-&gt;reply,listFirst(c-&gt;reply)); c-&gt;reply_bytes -= objmem; continue; } nwritten = write(fd, ((char*)o-&gt;ptr)+c-&gt;sentlen,objlen-c-&gt;sentlen); if (nwritten &lt;= 0) break; c-&gt;sentlen += nwritten; totwritten += nwritten; /* If we fully sent the object on head go to the next one */ if (c-&gt;sentlen == objlen) { listDelNode(c-&gt;reply,listFirst(c-&gt;reply)); c-&gt;sentlen = 0; c-&gt;reply_bytes -= objmem; } } /* */ server.stat_net_output_bytes += totwritten; if (totwritten &gt; NET_MAX_WRITES_PER_EVENT &amp;&amp; (server.maxmemory == 0 || zmalloc_used_memory() &lt; server.maxmemory)) break; } if (nwritten == -1) { if (errno == EAGAIN) { nwritten = 0; } else { serverLog(LL_VERBOSE, "Error writing to client: %s", strerror(errno)); freeClient(c); return C_ERR; } } if (totwritten &gt; 0) { /* */ if (!(c-&gt;flags &amp; CLIENT_MASTER)) c-&gt;lastinteraction = server.unixtime; } if (!clientHasPendingReplies(c)) { c-&gt;sentlen = 0; if (handler_installed) aeDeleteFileEvent(server.el,c-&gt;fd,AE_WRITABLE); if (c-&gt;flags &amp; CLIENT_CLOSE_AFTER_REPLY) { freeClient(c); return C_ERR; } } return C_OK;}

仔细分析以上代码我们可以发现server.stat_net_output_bytes增加的totwritten的值会累加每次进入while循环的值，然后如果while循环多次执行的情况下每次都会累加一次totwritten这个值，而这个值没有复位，导致server.stat_net_output_bytes的值会重复计算之前的值，最终导致出口流量计算错误，我们可以将server.stat_net_output_bytes的计算移动到while循环外即可修复这个统计问题。根据以上分析修改内核重新查看监控图标如下，我们可以看到监控的数值和天象采集到的数值基本一致了。

Redis流量统计问题如何解？阿里云工程师谈分析及修复

总结

由于云数据库的资源限制并非采用的server.stat_net_output_bytes的值，所以资源限制方面并不会由于原生内核的流量计算错误受到影响，目前这个问题已经提交了一个pull request给antirez等待官方确定合并修复。阿里云Redis致力于提供最好的云数据库Redis服务，我们正在寻找有一样志向的同学加入我们，有兴趣的同学请猛击链接：https://job.alibaba.com/zhaopin/position_detail.htm

Redis流量统计问题如何解？阿里云工程师谈分析及修复

背景

Redis 流量统计原理

总结

心灵净土

相关推荐

Linux服务器网卡流量统计监控软件vnStat

CentOS 安装nload(流量统计)

微软发布最新Win10 Build 9860新功能：Data Sense数据流量统计

流量统计的基本数据及作用概述

如何选择网站流量统计工具

Xposed 实现流量统计

求网站集群,流量统计的设计思路

流量统计

Linux查看网络流量的工具iptraf

linux - 利用vnstat进行流量统计

腾达路由器高级设置解析之下篇

Linux进程网络流量统计的实现过程

分享站长不得不知的SEO优化小帮手

全解边缘交换机的网络流量统计与监控能力

Shell脚本实现的单机流量统计功能

单个流量统计，CPU消耗量统计功能的详细说明

php流量统计功能的实现代码