fix(run-engine,webapp): abort unsealed batches so blocked parents resume#4016
fix(run-engine,webapp): abort unsealed batches so blocked parents resume#4016matt-aitken wants to merge 2 commits into
Conversation
batchTriggerAndWait blocks the parent on a waitpoint as soon as the batch is created, but the batch is only sealed once all its items finish streaming. If the item stream fails completely (rate limit, request timeout, crash), the batch stayed PENDING with no runs and the parent waited on its waitpoint forever. A seal-timeout reaper, scheduled when the batch is created, now aborts a batch that is still unsealed after BATCH_SEAL_TIMEOUT_MS (default 30 minutes) and completes the parent's waitpoint with an error so it resumes. The reaper is idempotent and no-ops if the stream sealed the batch in the meantime.
|
WalkthroughThis change adds a seal-timeout reaper for 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Don't abort a batch whose items were all processed (processingCompletedAt set) before the seal landed; those runs resume the parent on their own. Keep expireBatch idempotent: an already-aborted batch still resolves its parent waitpoint, so a retry after a mid-run crash can't leave the parent blocked. Treat ABORTED as terminal in tryCompleteBatch so a straggler run finalizing can't flip the batch back to COMPLETED. Abort the batch if scheduling the reaper fails, so a blocked parent is never stranded with nothing to free it.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts (2)
952-952: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winHard-coded delay could cause flakiness in CI.
The
setTimeout(500)waits for the run to be enqueued before dequeueing. In slow CI environments or under load, 500ms might not be enough, causing intermittent failures.Consider polling with
vi.waitForto check that the run is available for dequeue, or querying the queue length/state explicitly.
1032-1032: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winHard-coded delay could cause flakiness in CI.
The
setTimeout(1500)waits for the debounced completion job to process. In slow CI environments, the worker might not pick up the job within 1500ms, or conversely, the job might complete much faster, making tests unnecessarily slow.Consider using
vi.waitForto poll the batch state or worker queue depth, ensuring the job has been processed before asserting.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 23cf1c90-ff16-4b1a-a532-59c840e912dc
📒 Files selected for processing (3)
apps/webapp/app/runEngine/services/createBatch.server.tsinternal-packages/run-engine/src/engine/systems/batchSystem.tsinternal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- apps/webapp/app/runEngine/services/createBatch.server.ts
- internal-packages/run-engine/src/engine/systems/batchSystem.ts
📜 Review details
⏰ Context from checks skipped due to timeout. (25)
- GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
- GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
- GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
- GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects insteadImport from
@trigger.dev/sdkwhen writing Trigger.dev tasks. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamicimport()when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only frompackages/core(@trigger.dev/core), never import from the root
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.{test,spec}.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use vitest for all tests in the Trigger.dev repository
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
internal-packages/run-engine/src/engine/tests/**/*.test.ts
📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Implement tests for RunEngine in
src/engine/tests/using testcontainers for Redis and PostgreSQL containerization
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.test.{ts,tsx}: Never mock anything in tests - use testcontainers instead
Test files should be placed next to source files (e.g.,MyService.ts->MyService.test.ts)
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.{js,ts,tsx,jsx,css,json,md}
📄 CodeRabbit inference engine (AGENTS.md)
Use Prettier for code formatting and run
pnpm run formatbefore committing
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
**/*.test.{js,ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.test.{js,ts,tsx}: Test files should live beside the files under test and use descriptivedescribeanditblocks
Use vitest for unit testing
Tests should avoid mocks or stubs and use helpers from@internal/testcontainerswhen Redis or Postgres are needed
Files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
🧠 Learnings (10)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-06-13T19:53:13.759Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-06-17T17:13:49.929Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
📚 Learning: 2026-06-16T09:19:47.637Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3960
File: apps/webapp/test/prismaInfrastructureErrorCapture.test.ts:0-0
Timestamp: 2026-06-16T09:19:47.637Z
Learning: In this repo’s Vitest setup, `vitest.config.ts` uses `globals: true`, so identifiers like `vi`, `describe`, `it`, and `expect` are available as globals in Vitest test files. During code review, do not flag missing `vi`/`describe`/`it`/`expect` imports as a runtime error or correctness issue when they’re used in `*.test.ts/tsx` or `*.spec.ts/tsx` files. Explicit imports are still preferred for consistency, but they’re not required for runtime behavior.
Applied to files:
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts
🔇 Additional comments (2)
internal-packages/run-engine/src/engine/tests/batchTwoPhase.test.ts (2)
16-62: LGTM!
870-905: LGTM!
Summary
batchTriggerAndWaitcould leave the parent run suspended forever. When the item stream that populates a batch failed to complete after many retries (for example under rate limiting, a request timeout, or a crashed request), the batch was left with no runs and the parent kept waiting on a waitpoint that nothing would ever complete.Fix
The parent is blocked on the batch's waitpoint as soon as the batch is created, but the batch is only sealed once all its items finish streaming. There was no recovery path if streaming never reached the seal.
A seal-timeout reaper is now scheduled when the batch is created. If the batch is still unsealed after
BATCH_SEAL_TIMEOUT_MS(default 30 minutes), it is marked aborted and the parent's waitpoint is completed with an error, sobatchTriggerAndWaitrejects in the parent instead of hanging. The reaper is idempotent and race-safe: if the stream seals the batch first, the reaper no-ops.The default timeout is sized above the SDK's worst-case stream-retry budget (max attempts x server request timeout).