构建由BDD驱动的前端可观测性：集成Apollo Client与ELK Stack的实践复盘

可观测性

文章字数: 3.9k

阅读时长: 17 分

前端日志的混乱状态是一个长期存在的痛点。散落在代码各处的 console.log、缺乏上下文的错误信息、以及在生产环境中无法复现的“幽灵”问题，都让故障排查变成了一场噩梦。日志记录往往被视为一个次要的、事后的任务，导致其质量参差不齐，甚至在代码重构后被意外破坏。当线上问题发生时，我们最需要的就是高质量、结构化的遥测数据，但现实往往是我们只有一些零散的、无法关联的字符串。

这种混乱的根源在于，我们将可观测性视为一种“实现细节”，而不是一个“核心功能”。如果一个功能很重要，我们就应该为它编写测试。那么，我们能否像测试业务逻辑一样，去测试我们的日志和监控行为？

这个想法催生了一个新的实践方向：将行为驱动开发（BDD）的原则应用于前端可观测性体系的建设。我们不再仅仅是“记录一个错误”，而是定义一个明确的行为：“当用户执行某个操作并遭遇特定类型的失败时，系统必须生成一条包含特定上下文信息（如Correlation ID、用户标识、操作名称）的结构化日志，并确保其成功进入我们的中央日志系统。”

这个行为是可描述、可验证的。这就为我们的技术选型指明了方向。

在我们的技术栈中，前端使用 React 和 Apollo Client 与 GraphQL 后端通信，日志聚合平台则是成熟的 ELK Stack (Elasticsearch, Logstash, Kibana)。

BDD (Behavior-Driven Development): 我们选择 Gherkin 语法来描述可观测性场景。它提供了一种通用语言，让开发、SRE 甚至产品团队都能理解和确认关键的遥测行为。我们将使用 Cucumber.js作为测试执行器。
Apollo Client: 它的中间件系统 ApolloLink 是实现这一构想的完美切入点。我们可以创建一个自定义的 Link，它能拦截所有 GraphQL 请求和响应，成为生成和发送遥测数据的中心枢纽，而无需侵入任何业务组件代码。
ELK Stack: 作为日志的最终目的地，它也必须被纳入我们的测试闭环。BDD 测试不仅要触发日志，还要能反向查询 Elasticsearch，验证日志是否按预期格式被正确索引。

这个方案的核心是将可观测性从一个被动的、不可靠的副作用，转变为一个主动的、由测试套件保障的、一等公民的功能。

第一步：搭建本地测试环境

要实现测试闭环，我们需要一个完整的、轻量级的本地 ELK 环境。Docker Compose 是最直接的方案。

这里的关键是配置。我们需要确保 Logstash 的 HTTP input 插件能够接收来自前端的日志，并且 Elasticsearch 的端口对我们的 BDD 测试运行器是可访问的，以便进行结果验证。

docker-compose.yml:

version: '3.7'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.5.3
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false # 在测试环境中禁用安全特性以简化连接
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - "9200:9200"
    networks:
      - elk

  logstash:
    image: docker.elastic.co/logstash/logstash:8.5.3
    container_name: logstash
    ports:
      - "5044:5044"
      - "5000:5000/tcp" # 暴露HTTP输入端口
      - "9600:9600"
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
      - ./logstash/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    depends_on:
      - elasticsearch
    networks:
      - elk

  kibana:
    image: docker.elastic.co/kibana/kibana:8.5.3
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://es01:9200
    depends_on:
      - elasticsearch
    networks:
      - elk

networks:
  elk:
    driver: bridge

Logstash 的 pipeline 配置是连接前端和 Elasticsearch 的桥梁。我们使用 http input 插件监听 5000 端口，并用 json codec 解析传入的日志体。

logstash/pipeline/logstash.conf:

input {
  # 监听来自前端应用的HTTP POST请求
  http {
    port => 5000
    codec => "json"
    # 添加CORS头，允许来自任何源的请求，这在本地开发中是必要的
    # 在生产环境中，应该限制为你的应用域名
    add_header => {
      "Access-Control-Allow-Origin" => "*"
      "Access-Control-Allow-Methods" => "POST, OPTIONS"
      "Access-Control-Allow-Headers" => "Content-Type"
    }
  }
}

filter {
  # 可以在这里做一些数据清洗、转换或充实
  mutate {
    remove_field => ["http_version", "headers", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "frontend-logs-%{+YYYY.MM.dd}" # 按天创建索引
  }
  # 在调试时，将输出打印到stdout非常有用
  stdout { codec => rubydebug }
}

启动这个环境 (docker-compose up -d) 后，我们就拥有了一个可以接收、处理和存储前端日志的完整后端。

第二步：用 Gherkin 定义可观测性行为

现在，我们用自然语言来描述我们期望的日志行为。我们将创建一个 features/observability.feature 文件。

Feature: Frontend Application Observability

  Scenario: A successful GraphQL query should be logged with context
    Given the user is on the "Product" page
    And the GraphQL API will respond with success for the "GetProduct" query
    When the component mounts and executes the "GetProduct" query
    Then a structured log for the "GetProduct" operation should exist in Elasticsearch
    And the log should have a status of "success"
    And the log should contain the correct "correlationId"

  Scenario: A GraphQL error should be logged with detailed error information
    Given the user is on the "Product" page
    And the GraphQL API will respond with an error for the "GetProduct" query
    When the component mounts and executes the "GetProduct" query
    Then a structured log for the "GetProduct" operation should exist in Elasticsearch
    And the log should have a status of "error"
    And the log should contain the correct "correlationId"
    And the log should contain the GraphQL error message "Product not found"

这些场景清晰地定义了“完成”的标准。它不仅仅是代码能跑，而是系统的遥测行为必须符合预期。correlationId 是这里的关键，它是连接前端操作和后端日志的唯一标识。

第三’步：核心实现 - 自定义 Apollo Link

ApolloLink 是实现我们目标的核心。我们将创建一个 ObservabilityLink，它会串联在请求链中，负责生成上下文、捕获操作结果并发送日志。

这个 Link 必须是健壮的。它自身的失败不应该影响正常的业务流程。

src/apollo/ObservabilityLink.ts:

import { ApolloLink, Operation, NextLink, FetchResult } from '@apollo/client';
import { Observable } from '@apollo/client/utilities';
import { v4 as uuidv4 } from 'uuid';

// 定义日志负载的结构
interface LogPayload {
  timestamp: string;
  correlationId: string;
  operationName: string;
  status: 'success' | 'error';
  source: 'frontend';
  context: {
    userId?: string; // 示例上下文
    sessionId?: string;
  };
  request: {
    variables: Record<string, any>;
  };
  response?: {
    data?: Record<string, any>;
    errors?: any[];
  };
  durationMs: number;
}

// 日志发送的目标，在真实项目中应该从配置中读取
const LOGSTASH_ENDPOINT = 'http://localhost:5000';

export class ObservabilityLink extends ApolloLink {
  
  private sendLog(payload: LogPayload) {
    // 使用 sendBeacon 或 fetch 的 keepalive 标志在生产中更可靠
    // 但为了在测试环境中简化，我们这里直接用fetch
    // 这里的错误处理至关重要，日志发送的失败绝不能阻塞主应用
    fetch(LOGSTASH_ENDPOINT, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(payload),
    }).catch(error => {
      // 在生产环境中，你可能想把失败的日志暂存到localStorage
      // 或者报告给一个专门的错误监控服务
      console.error('Failed to send log to Logstash:', error);
    });
  }

  public request(operation: Operation, forward: NextLink): Observable<FetchResult> {
    const startTime = Date.now();
    const correlationId = uuidv4();
    
    // 将 correlationId 注入到请求头，以便后端服务可以接收并进行全链路追踪
    operation.setContext(({ headers = {} }) => ({
      headers: {
        ...headers,
        'X-Correlation-ID': correlationId,
      },
    }));

    // 将 correlationId 存储在操作上下文中，方便后续访问
    operation.setContext({ correlationId });

    return new Observable<FetchResult>(observer => {
      const subscription = forward(operation).subscribe({
        next: result => {
          const durationMs = Date.now() - startTime;
          const { operationName, variables } = operation;

          const payload: LogPayload = {
            timestamp: new Date().toISOString(),
            correlationId,
            operationName,
            status: result.errors ? 'error' : 'success',
            source: 'frontend',
            context: {
              // 在真实应用中，从认证状态管理中获取
              userId: 'user-123', 
              sessionId: 'session-abc',
            },
            request: {
              variables,
            },
            response: {
              // 为了避免日志过大，可以选择性地记录数据或完全不记录
              // data: result.data, 
              errors: result.errors,
            },
            durationMs,
          };

          this.sendLog(payload);
          observer.next(result);
        },
        error: networkError => {
          // 处理网络错误或 Link 链中的其他错误
          const durationMs = Date.now() - startTime;
          const { operationName, variables } = operation;

          const payload: LogPayload = {
            timestamp: new Date().toISOString(),
            correlationId,
            operationName,
            status: 'error',
            source: 'frontend',
            context: {
              userId: 'user-123',
              sessionId: 'session-abc',
            },
            request: {
              variables,
            },
            response: {
              errors: [{ message: networkError.message, stack: networkError.stack }],
            },
            durationMs,
          };
          
          this.sendLog(payload);
          observer.error(networkError);
        },
        complete: () => {
          observer.complete();
        },
      });

      return () => {
        if (subscription) {
          subscription.unsubscribe();
        }
      };
    });
  }
}

这个 ObservabilityLink 做了几件关键的事情：

为每个操作生成一个唯一的 correlationId。
通过 setContext 将 correlationId 注入请求头，这是实现端到端链路追踪的基础。
监听 forward(operation) 返回的 Observable，无论是成功 (next) 还是失败 (error)，都能捕获到结果。
构造一个详细的、结构化的 LogPayload 对象。
通过一个独立的 sendLog 方法将日志发送到 Logstash，并包含自己的错误处理逻辑。

第四步：实现 BDD 步骤定义 - 连接行为与代码

现在我们需要编写 Javascript/Typescript 代码，将 Gherkin 场景中的步骤映射到实际的操作和断言。这里会用到 @cucumber/cucumber 来定义步骤，@testing-library/react 来渲染和交互组件，以及 @elastic/elasticsearch 客户端来查询日志。

features/step_definitions/observability_steps.js:

const { Given, When, Then, Before, After } = require('@cucumber/cucumber');
const { Client } = require('@elastic/elasticsearch');
const { render, screen, waitFor } = require('@testing-library/react');
const { MockedProvider } = require('@apollo/client/testing');
const React = require('react');
// 假设这是我们的组件和 Apollo Client 配置
const { ProductComponent } = require('../../src/components/ProductComponent');
const { createClient } = require('../../src/apollo/client');
const { GET_PRODUCT_QUERY } = require('../../src/graphql/queries');

// 在测试开始前初始化 Elasticsearch 客户端
let esClient;
Before(function () {
  esClient = new Client({ node: 'http://localhost:9200' });
});

// 测试结束后清理
After(function () {
  // 可以在这里清理测试期间产生的索引，但为了演示，我们省略
});

let mocks = [];
let correlationId;

Given('the user is on the "Product" page', function () {
  // 这是一个准备步骤，暂时不做任何操作
});

Given('the GraphQL API will respond with success for the "GetProduct" query', function () {
  mocks = [
    {
      request: {
        query: GET_PRODUCT_QUERY,
        variables: { id: 'product-1' },
      },
      result: {
        data: {
          product: { id: 'product-1', name: 'Awesome Widget', price: 99.99 },
        },
      },
    },
  ];
});

Given('the GraphQL API will respond with an error for the "GetProduct" query', function () {
  mocks = [
    {
      request: {
        query: GET_PRODUCT_QUERY,
        variables: { id: 'product-1' },
      },
      error: new Error('Product not found'),
    },
  ];
});

When('the component mounts and executes the "GetProduct" query', async function () {
  const client = createClient(); // createClient 现在应该包含我们的 ObservabilityLink
  
  // 监听 Apollo Client 的内部状态来捕获 correlationId
  // 这是一个测试技巧，实际应用中不需要这样做
  const captureLink = new ApolloLink((operation, forward) => {
    correlationId = operation.getContext().correlationId;
    this.correlationId = correlationId; // 将其保存在测试上下文中
    return forward(operation);
  });

  const clientWithCapture = createClient(captureLink);

  render(
    <MockedProvider mocks={mocks} client={clientWithCapture} addTypename={false}>
      <ProductComponent id="product-1" />
    </MockedProvider>
  );

  // 等待操作完成
  await new Promise(resolve => setTimeout(resolve, 500)); // 等待日志异步发送和处理
});

// 一个健壮的、带重试的查询函数
async function findLogByCorrelationId(correlationId, retries = 5, delay = 300) {
  for (let i = 0; i < retries; i++) {
    try {
      const result = await esClient.search({
        index: 'frontend-logs-*', // 搜索所有相关索引
        body: {
          query: {
            match: {
              "correlationId": correlationId
            }
          }
        }
      });

      if (result.hits.hits.length > 0) {
        return result.hits.hits[0]._source;
      }
    } catch (err) {
      console.error(`Elasticsearch query failed (attempt ${i + 1}):`, err.message);
    }
    await new Promise(resolve => setTimeout(resolve, delay));
  }
  return null;
}


Then('a structured log for the {string} operation should exist in Elasticsearch', async function (operationName) {
  const log = await findLogByCorrelationId(this.correlationId);
  if (!log) {
    throw new Error(`Log with correlationId ${this.correlationId} not found in Elasticsearch.`);
  }
  if (log.operationName !== operationName) {
    throw new Error(`Expected operation name to be ${operationName}, but got ${log.operationName}`);
  }
  this.log = log; // 保存日志以便后续步骤断言
});

Then('the log should have a status of {string}', function (status) {
  if (this.log.status !== status) {
    throw new Error(`Expected log status to be ${status}, but got ${this.log.status}`);
  }
});

Then('the log should contain the correct "correlationId"', function () {
  if (this.log.correlationId !== this.correlationId) {
    throw new Error(`Log correlationId does not match the one from the request.`);
  }
});

Then('the log should contain the GraphQL error message {string}', function (errorMessage) {
  const foundError = this.log.response.errors.some(e => e.message === errorMessage);
  if (!foundError) {
    throw new Error(`Expected to find error message "${errorMessage}" in the log.`);
  }
});

这个步骤定义文件的精髓在于 Then 部分。它使用 Elasticsearch 客户端去验证行为的最终结果，而不是仅仅信任代码“应该”会发送日志。findLogByCorrelationId 函数中的重试逻辑是至关重要的，因为它处理了从日志发送到被 Elasticsearch 索引之间的延迟，让测试更加稳定。

第五步：整合与验证

最后，我们需要修改 Apollo Client 的初始化代码，将 ObservabilityLink 加入到 Link 链中。

src/apollo/client.js:

import { ApolloClient, InMemoryCache, HttpLink, from } from '@apollo/client';
import { ObservabilityLink } from './ObservabilityLink';

export const createClient = (testLink = null) => {
  const httpLink = new HttpLink({ uri: '/graphql' });
  const observabilityLink = new ObservabilityLink();

  // 'from' 方法按顺序链接，请求从左到右，响应从右到左
  const linkChain = testLink 
    ? from([observabilityLink, testLink, httpLink])
    : from([observabilityLink, httpLink]);

  return new ApolloClient({
    link: linkChain,
    cache: new InMemoryCache(),
  });
};

当我们运行 cucumber-js 时，它会执行 .feature 文件中定义的场景，调用对应的步骤定义，驱动我们的 React 组件，触发 ObservabilityLink，最终在 Elasticsearch 中验证日志的生成。

整个流程可以用下面的图来概括：

sequenceDiagram
    participant BDD Runner as BDD
Test Runner
    participant ReactComponent as React Component
    participant ApolloClient as Apollo Client
(with ObsLink)
    participant Logstash
    participant Elasticsearch

    BDD Runner->>+ReactComponent: Mounts component (When step)
    ReactComponent->>+ApolloClient: Executes GraphQL query
    ApolloClient->>ApolloClient: ObservabilityLink generates
correlationId
    ApolloClient->>Logstash: Sends structured log (async)
    Logstash->>+Elasticsearch: Processes and indexes log
    Elasticsearch-->>-Logstash: Indexing confirmation
    ApolloClient-->>-ReactComponent: Returns data/error
    ReactComponent-->>-BDD Runner: UI update/state change

    %% Verification Phase
    BDD Runner->>+Elasticsearch: Queries for log using
correlationId (Then step)
    Elasticsearch-->>-BDD Runner: Returns found log document
    BDD Runner->>BDD Runner: Asserts log content is correct

这个闭环系统确保了我们的可观测性基础设施不仅被构建出来，而且其行为是持续受到自动化测试保障的。任何对 ObservabilityLink 的意外破坏，或是对日志格式的无意更改，都会立刻导致 BDD 测试失败。

局限性与未来迭代方向

这套方案虽然强大，但在真实生产环境中也存在一些需要权衡和优化的点。

性能开销: 在 ObservabilityLink 中序列化和发送日志会给事件循环带来微小的压力。对于性能敏感的应用，可以考虑将日志发送逻辑转移到 Web Worker 中，或者使用 navigator.sendBeacon API，它专门用于在页面卸载前可靠地发送少量数据，而不会阻塞主线程。
日志采样: 在高流量应用中，为每一次 GraphQL 操作都记录日志是不现实的，会造成巨大的成本和噪音。ObservabilityLink 需要变得更智能。它可以从远端配置（例如，通过另一个 GraphQL 查询或从配置服务）获取动态的采样规则，比如只记录 1% 的成功请求，但记录 100% 的失败请求，或者只针对特定 operationName 进行详细日志记录。
测试环境的脆弱性: 依赖于 Docker 容器网络的端到端测试可能会因为网络抖动、容器启动缓慢等问题而变得不稳定。对于 ObservabilityLink 的核心逻辑，应该补充单元测试，使用 Jest 等工具模拟 forward 函数，验证日志负载是否被正确构建，而无需启动整个 ELK 栈。端到端 BDD 测试则作为更高层次的集成验证。
上下文信息的丰富: 当前示例中的上下文（userId, sessionId）是硬编码的。在实际项目中，需要一个干净的机制将应用的上下文（如认证状态、功能开关状态、版本号等）注入到 ObservabilityLink 中，这可能需要与状态管理库（如 Redux, MobX）进行集成。

ELK Stack Apollo Client BDD Observability Frontend

构建基于 Caddy、DynamoDB 与 Tekton 的无服务器 WebSocket 广播架构

2023-10-27 分布式架构

Tekton WebSockets DynamoDB Caddy

基于 Azure Functions 与 Weaviate 构建语义化日志分析的可观测性管道

2023-10-27 可观测性

Observability Azure Functions Grafana Weaviate Vector Database