在服务网格中关联 Sentry 与 Envoy 实现 JavaScript 应用的端到端可观测性

可观测性

文章字数: 3.3k

阅读时长: 15 分

一次线上故障排查，始于 Sentry 的一个寻常错误报告：Error: Request failed with status code 500。错误堆栈指向我们 Node.js 服务中的一个 HTTP 客户端，它在调用下游服务 user-service 时失败。表面上看，这是一个典型的应用层错误。但问题在于，user-service 的 Sentry 平台和各项指标显示其自身运行完全正常，没有任何错误或延迟。

// services/apiClient.js
import axios from 'axios';
import * as Sentry from '@sentry/node';

async function callUserService(userId) {
  const transaction = Sentry.getCurrentHub().getScope().getTransaction();
  const span = transaction?.startChild({ op: 'http.client', description: `GET /users/${userId}` });

  try {
    const response = await axios.get(`http://user-service.default.svc.cluster.local/users/${userId}`, {
      headers: {
        'Content-Type': 'application/json',
      },
    });
    span?.setStatus('ok');
    return response.data;
  } catch (error) {
    Sentry.captureException(error, {
      tags: { component: 'ApiClient' },
      extra: { userId },
    });
    span?.setStatus('internal_error');
    throw error; // Rethrow the original error
  } finally {
    span?.finish();
  }
}

这段代码看似无懈可击，Sentry 的集成也足够标准。然而，Sentry 报告的信息在这里抵达了边界。我们无法回答几个关键问题：

这个 500 错误是 user-service 应用本身返回的，还是在到达它之前就被拒绝了？
请求是否因为网络策略、熔断器触发或上游连接池耗尽而被服务网格的 Sidecar (Envoy) 提前终止？
请求在网络传输中耗时多久？Envoy 代理引入了多少延迟？

在服务网格架构中，应用代码只构成了请求生命周期的一部分。流量由 Envoy 代理拦截和管理，网络层的行为对应用是透明的，但也正因此，当问题出在网络层时，应用层的监控工具就如同盲人摸象。Sentry 知道应用内部发生了什么，而 Istio 的遥测知道网络上发生了什么，这两者之间存在一条巨大的鸿沟。我们的目标，就是架起一座桥梁，将 Envoy 的网络遥测数据与 Sentry 的应用错误报告精确关联起来。

初步构想与技术选型

最直接的想法是利用分布式追踪的上下文传播。Istio 默认会注入 B3 Propagation Headers (如 x-b3-traceid, x-b3-spanid)。Sentry 的 Node.js SDK 也能够识别并沿用这些 Trace ID 来创建自己的 Transaction。这解决了链路串联的问题，我们可以在 Jaeger 或 Zipkin 中看到完整的调用链。但这并没有解决我们的核心痛点：当 Sentry 捕获到一个 Exception 时，它仍然是一个孤立的应用层事件，我们无法直接在 Sentry 的错误详情页看到与之关联的那个网络请求的具体遥测数据。

单纯的 Trace ID 关联还不够。我们需要的是**数据富化 (Data Enrichment)**。即，在 Sentry 事件上报前的某个环节，用这次请求在 Envoy 代理上的观测数据来“增强”这个事件。

方案有几个：

应用内主动查询：在 catch 块里，应用通过 trace_id 去查询遥测后端（如 Prometheus, Loki）获取网络数据。这个方案侵入性太强，增加了故障处理路径的复杂度和延迟，且严重依赖遥测后端，极易引发次生灾害。不可取。
日志聚合与后处理：将应用的 Sentry 事件和 Envoy 的访问日志（都包含 trace_id）发送到统一的日志平台（如 ELK, Splunk）。通过后续的关联查询来分析。这是一个可行的分析方案，但它无法改善 Sentry 本身的告警体验。工程师收到 Sentry 告警时，仍需跳转到另一个系统手动查询。
自定义 Sentry Integration：编写一个 Sentry Integration，它利用 Sentry SDK 的 addGlobalEventProcessor 钩子。在事件发送前，这个处理器会拿到 trace_id，并通过一个轻量级的、高可用的遥测服务获取对应的 Envoy 数据，然后将其附加到事件的 tags 和 extra 字段中。

我们选择了方案三。它将丰富后的数据直接呈现在了 Sentry UI 上，为开发者提供了最直接的上下文，是解决“第一现场”信息不足问题的最佳路径。这个方案的技术核心在于两个部分：一是配置 Envoy 产出包含关键信息的、结构化的访问日志；二是在 Node.js 中实现这个自定义的 Sentry Integration。

步骤化实现

1. 配置 Envoy 输出结构化遥测日志

我们需要让 Istio 的 Envoy 代理输出包含 Trace ID 和丰富网络元数据的 JSON 格式访问日志。这可以通过创建一个 EnvoyFilter 资源来实现。在真实项目中，我们会将这些日志通过 Fluentd 或其他日志采集器发送到集中的存储，并提供一个低延迟的查询接口。

下面是一个 EnvoyFilter 的配置示例，它应用于所有工作负载，修改了默认的访问日志格式。

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: "sentry-enrichment-access-log"
  namespace: istio-system
spec:
  workloadSelector:
    # Apply to all workloads in the mesh
    labels: {}
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_OUTBOUND # Or SIDECAR_INBOUND, GATEWAY
        listener:
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        operation: MERGE
        value:
          typed_config:
            "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
            access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": "type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog"
                  log_format:
                    json_format:
                      trace_id: "%REQ(X-B3-TRACEID)%"
                      span_id: "%REQ(X-B3-SPANID)%"
                      sentry_trace: "%REQ(SENTRY-TRACE)%"
                      response_code: "%RESPONSE_CODE%"
                      response_flags: "%RESPONSE_FLAGS%"
                      bytes_sent: "%BYTES_SENT%"
                      bytes_received: "%BYTES_RECEIVED%"
                      duration_ms: "%DURATION%"
                      upstream_service_time_ms: "%UPSTREAM_SERVICE_TIME%"
                      upstream_host: "%UPSTREAM_HOST%"
                      upstream_cluster: "%UPSTREAM_CLUSTER%"
                      route_name: "%ROUTE_NAME%"
                      request_path: "%REQ(:PATH)%"
                      request_method: "%REQ(:METHOD)%"
                      user_agent: "%REQ(USER-AGENT)%"
                      protocol: "%PROTOCOL%"

这里的关键字段是 response_flags。这是 Envoy 的一个强大特性，它能精确地告诉我们请求失败的根本网络原因。例如：

UC: Upstream connection failure (上游连接失败)
UO: Upstream overflow (熔断器触发)
NR: No route configured (没有匹配的路由)
LH: Local health check failed (本地健康检查失败)

这些信息是应用层完全无法感知的，却是排查服务网格故障的金钥匙。

2. 实现 Node.js 中的 `EnvoyContextIntegration`

现在进入核心部分：编写自定义的 Sentry Integration。它的职责是在 Sentry 事件发送前，通过 trace_id 去查询我们刚刚配置的结构化日志（假设它们已经被收集到一个可查询的服务中）。

为了演示，我们先创建一个模拟的遥测查询服务。在真实世界中，这可能是一个查询 Loki、ClickHouse 或一个小型 Redis 缓存的内部 API。

// telemetry-service-mock.js

// This is a mock service. In a real system, this would query a log aggregation
// platform like Loki, Elasticsearch, or a dedicated telemetry database.
const telemetryStore = new Map();

// Simulate Envoy writing a log
function logEnvoyTelemetry(data) {
  if (data.trace_id) {
    telemetryStore.set(data.trace_id, data);
  }
}

// The API our Sentry integration will call
async function getEnvoyTelemetryByTraceId(traceId) {
  if (!traceId) {
    return null;
  }
  // Simulate network delay and potential lookup failure
  await new Promise(resolve => setTimeout(resolve, 20)); // 20ms lookup latency
  
  const data = telemetryStore.get(traceId);
  if (data) {
    return data;
  }

  // A common issue: log ingestion delay. The telemetry might not be available yet.
  console.warn(`[TelemetryService] Telemetry for traceId ${traceId} not found.`);
  return null;
}

// Example of an incoming log from our Envoy proxy
logEnvoyTelemetry({
  trace_id: "a1b2c3d4e5f6a7b8",
  span_id: "b8a7b6c5d4e3f2a1",
  sentry_trace: "a1b2c3d4e5f6a7b8-b8a7b6c5d4e3f2a1-1",
  response_code: "503",
  response_flags: "UC",
  bytes_sent: "0",
  bytes_received: "57",
  duration_ms: "15",
  upstream_service_time_ms: "0",
  upstream_host: "10.0.1.23:8080",
  upstream_cluster: "outbound|8080||user-service.default.svc.cluster.local",
  route_name: "default",
  request_path: "/users/123",
  request_method: "GET",
  user_agent: "axios/0.21.4",
  protocol: "HTTP/1.1"
});

export { getEnvoyTelemetryByTraceId };

接下来是我们的主角 EnvoyContextIntegration。

// integrations/EnvoyContextIntegration.js
import { getCurrentHub } from '@sentry/node';
import { getEnvoyTelemetryByTraceId } from '../telemetry-service-mock.js';

/**
 * A Sentry Integration that enriches events with Envoy proxy telemetry.
 * It correlates events using the trace ID, fetching network-level context
 * from a telemetry service.
 */
export class EnvoyContextIntegration {
  // Required for Sentry integrations
  static id = 'EnvoyContext';
  name = EnvoyContextIntegration.id;

  /**
   * Sets up the integration by adding a global event processor.
   * This processor intercepts every event before it's sent to Sentry.
   */
  setupOnce(addGlobalEventProcessor, getCurrentHub) {
    addGlobalEventProcessor(async (event, hint) => {
      const self = getCurrentHub().getIntegration(EnvoyContextIntegration);
      if (!self) {
        return event;
      }
      
      // We only care about events that are part of a transaction (i.e., a request)
      const scope = getCurrentHub().getScope();
      const span = scope.getSpan();
      
      if (!span) {
        return event;
      }

      const { traceId } = span.spanContext();

      if (!traceId) {
        console.warn('[EnvoyContextIntegration] Event has no traceId, cannot enrich.');
        return event;
      }
      
      try {
        // Here's the core logic: fetch telemetry data using the traceId.
        // A timeout is crucial here to avoid delaying the event submission indefinitely.
        const telemetry = await this.fetchTelemetryWithTimeout(traceId, 100);

        if (telemetry) {
          // If data is found, enrich the Sentry event.
          // 'tags' are indexed and searchable, ideal for critical flags.
          event.tags = {
            ...event.tags,
            'envoy.response_flags': telemetry.response_flags,
            'envoy.route_name': telemetry.route_name,
            'envoy.upstream_cluster': telemetry.upstream_cluster,
          };
          
          // 'extra' can hold more detailed, non-indexed information.
          event.extra = {
            ...event.extra,
            envoy_telemetry: telemetry,
          };
        }
      } catch (err) {
        // A common pitfall: the enrichment process itself must not fail the event reporting.
        console.error(`[EnvoyContextIntegration] Failed to enrich Sentry event: ${err.message}`);
      }

      return event;
    });
  }

  /**
   * A robust fetcher with a timeout. In a production system, you might also
   * want to add a circuit breaker or retry logic.
   * @param {string} traceId The trace ID to look up.
   * @param {number} timeoutMs The timeout in milliseconds.
   */
  async fetchTelemetryWithTimeout(traceId, timeoutMs) {
    return new Promise((resolve, reject) => {
      const timeoutId = setTimeout(() => {
        reject(new Error('Telemetry service lookup timed out'));
      }, timeoutMs);

      getEnvoyTelemetryByTraceId(traceId)
        .then(result => {
          clearTimeout(timeoutId);
          resolve(result);
        })
        .catch(err => {
          clearTimeout(timeoutId);
          reject(err);
        });
    });
  }
}

这个 Integration 的设计有几个关键点：

健壮性: 查询遥测服务的逻辑被包裹在 try...catch 中，并且有超时控制。富化逻辑的失败绝不能影响正常的错误上报。这是一个绝对原则。
数据结构: 我们策略性地使用 tags 和 extra。response_flags 这种高信号、低基数的字段放在 tags 中，便于在 Sentry UI 中搜索和过滤。而完整的遥测日志对象则放在 extra 中，供深度排查时使用。
上下文关联: 它依赖 Sentry SDK 通过 AsyncLocalStorage 维护的当前 Hub 和 Scope 来获取 traceId，确保了在异步的 Node.js 环境中也能正确拿到当前请求的上下文。

3. 整合到应用中

最后一步，在应用启动时初始化 Sentry SDK，并加入我们的自定义 Integration。

// server.js
import express from 'express';
import * as Sentry from '@sentry/node';
import * as Tracing from '@sentry/tracing';
import { EnvoyContextIntegration } from './integrations/EnvoyContextIntegration.js';
import { getEnvoyTelemetryByTraceId } from './telemetry-service-mock.js';

const app = express();

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  integrations: [
    // Our custom integration MUST be listed.
    new EnvoyContextIntegration(),
    // Sentry's default integrations for Node.js
    new Sentry.Integrations.Http({ tracing: true }),
    new Tracing.Integrations.Express({ app }),
  ],
  tracesSampleRate: 1.0,
});

// Sentry request handler must be the first middleware
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());

app.get('/user/:id', async (req, res, next) => {
  try {
    // This is a mock of the original failing API call.
    // We simulate a scenario where Envoy fails the request.
    const transaction = Sentry.getCurrentHub().getScope().getTransaction();
    const traceId = transaction?.spanContext().traceId;
    
    // In our test, we manually tie the traceId to the pre-logged telemetry.
    // In a real system, Istio would propagate this header automatically.
    if (traceId === "a1b2c3d4e5f6a7b8") {
       throw new Error('Simulated upstream connection failure');
    }

    res.status(200).send({ id: req.params.id, name: 'Test User' });
  } catch(error) {
    // Manually capture exception, as our integration works on captured events.
    Sentry.captureException(error, {
      extra: { userId: req.params.id }
    });
    // The error handler will take care of the response.
    next(error);
  }
});


// Sentry error handler must be before any other error middleware and after all controllers
app.use(Sentry.Handlers.errorHandler());

// Optional fallthrough error handler
app.use(function onError(err, req, res, next) {
  res.statusCode = 500;
  res.end(res.sentry + '\n');
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

当一个请求的 traceId 恰好是 a1b2c3d4e5f6a7b8 时，应用会抛出异常。Sentry SDK 捕获它，我们的 EnvoyContextIntegration 会被触发。它会用这个 traceId 调用 getEnvoyTelemetryByTraceId，获取到我们预置的遥测数据，并将其附加到 Sentry 事件上。

最终发送到 Sentry 的事件载荷会是这样的（简化后）：

{
  "event_id": "...",
  "level": "error",
  "exception": { ... },
  "tags": {
    "component": "ApiClient",
    "envoy.response_flags": "UC",
    "envoy.route_name": "default",
    "envoy.upstream_cluster": "outbound|8080||user-service.default.svc.cluster.local"
  },
  "extra": {
    "userId": "123",
    "envoy_telemetry": {
      "trace_id": "a1b2c3d4e5f6a7b8",
      "response_code": "503",
      "response_flags": "UC",
      "duration_ms": "15",
      "upstream_host": "10.0.1.23:8080",
      ...
    }
  },
  "transaction": "GET /user/:id",
  "contexts": {
    "trace": {
      "op": "http.server",
      "trace_id": "a1b2c3d4e5f6a7b8",
      "span_id": "...",
    }
  }
}

在 Sentry UI 上，工程师第一眼就能看到 envoy.response_flags: UC 这个标签。他们立刻就能断定，这不是应用代码的逻辑错误，而是基础设施层面的上游连接失败。他们可以立即转向排查 user-service 的网络可达性、Kubernetes Service 定义或网络策略，而不是在应用代码的泥潭里徒劳地打转。

sequenceDiagram
    participant User
    participant EnvoySidecar as Envoy Sidecar (Caller)
    participant NodeApp as Node.js App
    participant SentrySDK as Sentry SDK w/ Integration
    participant TelemetrySvc as Telemetry Service
    participant SentryAPI as Sentry API

    User->>+EnvoySidecar: GET /user/123 (TraceID: T1)
    EnvoySidecar->>+NodeApp: GET /user/123 (Headers with T1)
    NodeApp->>SentrySDK: Start Transaction (TraceID: T1)
    Note over NodeApp: Throws Exception
    NodeApp->>SentrySDK: Sentry.captureException(error)
    SentrySDK->>SentrySDK: Event Processor Triggered
    SentrySDK->>+TelemetrySvc: getTelemetry(traceId=T1)
    TelemetrySvc-->>-SentrySDK: {response_flags: 'UC', ...}
    SentrySDK->>SentrySDK: Enrich event with Envoy data
    SentrySDK->>+SentryAPI: Send Enriched Event
    SentryAPI-->>-SentrySDK: OK
    NodeApp-->>-EnvoySidecar: 500 Internal Server Error
    EnvoySidecar-->>-User: 500 Internal Server Error

方案的局限性与未来展望

这个方案虽然强大，但也并非没有成本。首先，它引入了一个新的依赖——遥测查询服务。这个服务的可用性和性能直接影响到 Sentry 事件的富化效果。在生产环境中，需要确保该服务是高可用的，并且查询延迟极低。

其次，存在数据一致性的窗口期。Envoy 访问日志的采集、处理、索引都需要时间。如果 Sentry 事件上报时，对应的日志还未被查询到，那么这次富化就会失败。在实践中，可以为遥测查询服务增加一定的等待和重试逻辑，或者接受一小部分事件富化失败的可能性。

未来的优化路径可以探索使用 OpenTelemetry Collector。它可以作为一个标准化的中间层，同时接收来自 Sentry SDK 的事件和来自 Envoy 的遥测数据。在 Collector 内部，通过 batch 和 memory_limiter 处理器，可以更高效、更可靠地完成事件的关联与富化，再统一发送到 Sentry 后端。这种方式将富化逻辑从应用进程中解耦出来，移到了专门的观测数据管道中，使得整个架构更加清晰和健壮。