缓存缓存

Cache meme logo

什么是Http缓存?

通过网络请求较大的资源文件可能会比较慢, 也会导致很多流量的开销, 通过http缓存让客户端在一定的规则内使用上次请求返回的结果, 减少用户的等待时间, 也减少服务器压力.

问题现象

过了2天语音助手咋还是老版本没更新?
终于找到一台是老版本的设备, 连接上chrome inspect一下, 为什么就直接更新了?
有些设备马上就可以看到更新了。

如何解决?

1 抓包

页面没有更新, 说明访问的html文件没有获取到最新的, 那就猜测是不是请求html时返回了304导致了问题. 但抓包结果是, 连html页面的请求都没有. 还有一个神奇的现象是, 把手机的网络给断了之后, 仍然能够加载显示我们的网页，难道不是应该显示无法加载的error页面吗? WTF?

2 Google了解问题

然后就只能去Google了, 首先查到的是Google的Web Fundamentals教程里的说明:

cache google

When the server returns a response, it also emits a collection of HTTP headers, describing its content-type, length, caching directives, validation token, and more. For example, in the above exchange, the server returns a 1024-byte response, instructs the client to cache it for up to 120 seconds, and provides a validation token (“x234dff”) that can be used after the response has expired to check if the resource has been modified.

然后查看我们在请求https://eq.10jqka.com.cn/ai/pubrobot/时的response的信息, 发现并没有携带max-age, 只有一个etag参数.

问了资深前端开发, 很少会有不设置max-age只设置etag的情况. 所以问题变成了如果在请求response里面只带了etag缓存会怎么样? 那只有etag的话, 难道不是应该每一次核对etag在更新时候更新etag后能够实时更新用户端的代码呢? 答案当然不是和我们想的一样.

3 Stackoverflow找到类似的问题

https://stackoverflow.com/questions/14345898/what-heuristics-do-browsers-use-to-cache-resources-not-explicitly-set-to-be-cach/

Stackoverflow上有人在提问, 如果请求的资源没有显式地声音缓存的策略, 浏览器一般会怎么样响应? 并提供了如下的RFC文档,

13.2.2 Heuristic Expiration

Since origin servers do not always provide explicit expiration times, HTTP caches typically assign heuristic expiration times, employing algorithms that use other header values (such as the Last-Modified time) to estimate a plausible expiration time. The HTTP/1.1 specification does not provide specific algorithms, but does impose worst-case constraints on their results. Since heuristic expiration times might compromise semantic transparency, they ought to used cautiously, and we encourage origin servers to provide explicit expiration times as much as possible. HTTP/1.1 RFC 2616

里面说如果没有显式的缓存失效时间, 会使用一个启发式的时间, 但这个启发式的时间又是什么? 问题下面给出了公式.

return (creationTime - lastModifiedValue) * 0.1;

4 Webkit源码的解释

chromium 源码

enum ValidationType {
  VALIDATION_NONE,          // The resource is fresh.
  VALIDATION_ASYNCHRONOUS,  // The resource requires async revalidation.
  VALIDATION_SYNCHRONOUS    // The resource requires sync revalidation.
};

首先是三个常量, 无需验证, 异步验证, 同步验证.

// From RFC 2616 section 13.2.4:
//
// The calculation to determine if a response has expired is quite simple:
//
//   response_is_fresh = (freshness_lifetime > current_age)
//
// Of course, there are other factors that can force a response to always be
// validated or re-fetched.
//
// From RFC 5861 section 3, a stale response may be used while revalidation is
// performed in the background if
//
//   freshness_lifetime + stale_while_revalidate > current_age
//
ValidationType HttpResponseHeaders::RequiresValidation(
    const Time& request_time,
    const Time& response_time,
    const Time& current_time) const {
  FreshnessLifetimes lifetimes = GetFreshnessLifetimes(response_time);
  if (lifetimes.freshness.is_zero() && lifetimes.staleness.is_zero())
    return VALIDATION_SYNCHRONOUS;

  TimeDelta age = GetCurrentAge(request_time, response_time, current_time);

  if (lifetimes.freshness > age)
    return VALIDATION_NONE;

  if (lifetimes.freshness + lifetimes.staleness > age)
    return VALIDATION_ASYNCHRONOUS;

  return VALIDATION_SYNCHRONOUS;
}

浏览器是否要去服务端请求, 取决于当前的资源是否是fresh, 而是否fresh的判断是比较资源的设置的新鲜度freshness和资源当前的存活时间age大小来决定的.

简单来说就是freshness越大, 缓存的时间越久

freshness的值是根据资源返回的缓存策略cache-control头, 以及浏览器本地的设置决定的. 当freshness大于age, 说明目前存活时间还是新鲜的, 不需要请求服务器.

age的计算因为涉及到一些修正计算, 大致为上一次资源从服务器返回到浏览器的时间戳减去服务器生成这个资源的时间戳. 也就是:

apparent_age = response_time(浏览器) - date_value(服务器)

这个freshness的计算规则就在GetFreshnessLifetimes中.

// From RFC 2616 section 13.2.4:
//
// The max-age directive takes priority over Expires, so if max-age is present
// in a response, the calculation is simply:
//
//   freshness_lifetime = max_age_value
//
// Otherwise, if Expires is present in the response, the calculation is:
//
//   freshness_lifetime = expires_value - date_value
//
// Note that neither of these calculations is vulnerable to clock skew, since
// all of the information comes from the origin server.
//
// Also, if the response does have a Last-Modified time, the heuristic
// expiration value SHOULD be no more than some fraction of the interval since
// that time. A typical setting of this fraction might be 10%:
//
//   freshness_lifetime = (date_value - last_modified_value) * 0.10
//
// If the stale-while-revalidate directive is present, then it is used to set
// the |staleness| time, unless it overridden by another directive.
//
HttpResponseHeaders::FreshnessLifetimes
HttpResponseHeaders::GetFreshnessLifetimes(const Time& response_time) const {
  FreshnessLifetimes lifetimes;
  // Check for headers that force a response to never be fresh.  For backwards
  // compat, we treat "Pragma: no-cache" as a synonym for "Cache-Control:
  // no-cache" even though RFC 2616 does not specify it.
  if (HasHeaderValue("cache-control", "no-cache") ||
      HasHeaderValue("cache-control", "no-store") ||
      HasHeaderValue("pragma", "no-cache")) {
    return lifetimes;
  }

  // Cache-Control directive must_revalidate overrides stale-while-revalidate.
  bool must_revalidate = HasHeaderValue("cache-control", "must-revalidate");

  if (must_revalidate || !GetStaleWhileRevalidateValue(&lifetimes.staleness)) {
    DCHECK_EQ(TimeDelta(), lifetimes.staleness);
  }

  // NOTE: "Cache-Control: max-age" overrides Expires, so we only check the
  // Expires header after checking for max-age in GetFreshnessLifetimes.  This
  // is important since "Expires: <date in the past>" means not fresh, but
  // it should not trump a max-age value.
  if (GetMaxAgeValue(&lifetimes.freshness))
    return lifetimes;

  // If there is no Date header, then assume that the server response was
  // generated at the time when we received the response.
  Time date_value;
  if (!GetDateValue(&date_value))
    date_value = response_time;

  Time expires_value;
  if (GetExpiresValue(&expires_value)) {
    // The expires value can be a date in the past!
    if (expires_value > date_value) {
      lifetimes.freshness = expires_value - date_value;
      return lifetimes;
    }

    DCHECK_EQ(TimeDelta(), lifetimes.freshness);
    return lifetimes;
  }
  
  ...
  
  if ((response_code_ == 200 || response_code_ == 203 ||
       response_code_ == 206) &&
      !must_revalidate) {
    // TODO(darin): Implement a smarter heuristic.
    Time last_modified_value;
    if (GetLastModifiedValue(&last_modified_value)) {
      // The last-modified value can be a date in the future!
      if (last_modified_value <= date_value) {
        lifetimes.freshness = (date_value - last_modified_value) / 10;
        return lifetimes;
      }
    }
  }

  // These responses are implicitly fresh (unless otherwise overruled):
  if (response_code_ == 300 || response_code_ == 301 || response_code_ == 308 ||
      response_code_ == 410) {
    lifetimes.freshness = TimeDelta::Max();
    lifetimes.staleness = TimeDelta();  // It should never be stale.
    return lifetimes;
  }

  // Our heuristic freshness estimate for this resource is 0 seconds, in
  // accordance with common browser behaviour. However, stale-while-revalidate
  // may still apply.
  DCHECK_EQ(TimeDelta(), lifetimes.freshness);
  return lifetimes;
}

上面描述了计算freshness的方法:

date_value为header中标识服务器资源到达浏览器的时间戳.

如果cache-control中不需要缓存的值no-cache, no-store等, 直接返回0.
判断了must-revalidate.
cache-control中是否含有max-age设置, 有的话直接返回max-age值.
如果有expires的设置, expires_value - date_value就是新鲜度
如果资源上一次请求返回的是200等成功值, 新鲜度的值就为

lifetimes.freshness = (date_value - last_modified_value) / 10;

这个就是我们这次问题的关键原因, 当我们在服务器上的资源很久没有更新(last_modified_value很久远), 用户在新访问(date_value比较新)这个资源后, 会导致freshness特别大, 也就导致缓存的有效时间就越久.

比如我们资源是在7月11日更新(1562860767210), 用户在7月21日访问(1563724767210)后, freshness的值就为1天.

5. etag的用处

上面的源码中貌似没有etag的出现, 那etag是做什么的?

https://stackoverflow.com/questions/824152/what-takes-precedence-the-etag-or-last-modified-http-header/1560098

Note that the Client does not check the headers to see if they have changed; it just blindly uses them in the next conditional request; it is up to the Server to evaluate whether to send the requested content or a 304 Not Modified response. If the Server only sends one, then the Client will use that one alone (although, only strong validators are useful for a Range request). Of course, it is also at the discretion of intermediate caches (unless they have been prevented from caching via Cache Control directives) and the Server as to how they will act upon the headers; the RFC states that they MUST NOT return a 304 Not Modified if the validators are inconsisent, but since the header values are generated by the server, it has quite a bit of leeway.

意思是浏览器只是把etag在下一次该请求的时候回传给服务器, 到底返回什么内容(200 or 304), 是服务器来决定的. 而这个该请求的时机就是上面缓存的计算的规则.