Skip to content

Improve workflow perfs#18376

Merged
charlesBochet merged 4 commits intomainfrom
tt-improve-workflow-perfs
Mar 4, 2026
Merged

Improve workflow perfs#18376
charlesBochet merged 4 commits intomainfrom
tt-improve-workflow-perfs

Conversation

@thomtrp
Copy link
Copy Markdown
Contributor

@thomtrp thomtrp commented Mar 4, 2026

Workflow crons take a few minutes to run. Loading each repo takes ~200 to 300ms locally. Adding a lite mode so it takes less than 100ms.
Also doing batch promises.

Finally, cleaning runs timeout when there are too many. Doing batches as well.

@thomtrp thomtrp force-pushed the tt-improve-workflow-perfs branch from 8bb9879 to 1d88ad3 Compare March 4, 2026 13:12
@thomtrp thomtrp marked this pull request as ready for review March 4, 2026 13:19
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR improves workflow cron job performance by introducing a lite workspace context (skipping feature flags, permissions, indexes and RLS) that cuts workspace loading time from ~200–300ms to <100ms, and by parallelising per-workspace checks with Promise.allSettled in batches of 50. It also refactors workflow run cleanup to use batched raw SQL DELETE … RETURNING loops instead of a full ORM-delete-per-row approach, and eliminates a redundant getWorkflowRunOrFail DB call in the iterator action.

Key changes:

  • loadLiteWorkspaceContext added to GlobalWorkspaceOrmManager to skip expensive metadata loading (feature flags, permissions, indexes, RLS predicates).
  • Three cron jobs (WorkflowCleanWorkflowRunsCronJob, WorkflowHandleStaledRunsCronJob, WorkflowRunEnqueueCronJob) now process workspaces in parallel batches of 50 with Promise.allSettled.
  • WorkflowCleanWorkflowRunsJob replaces bulk ORM deletes with two batched raw-SQL loops in deleteOldRuns and deleteExcessRunsPerWorkflow.
  • skipAndFailSafelyStepsThenContinue is promoted to public and batches step-info updates into a single DB write.
  • updateWorkflowRunStepInfos now performs a per-step deep merge (preserving existing fields like result when only status is updated) rather than shallow replacement.
  • Iterator action removes the redundant second getWorkflowRunOrFail call; test mocks should include explicit call-count assertions to prevent regressions.

Confidence Score: 4/5

  • Safe to merge; performance improvements are well-scoped and logic changes are localised to cron/cleanup paths.
  • The lite-context optimisation is sound for read-only count queries. Iterator test should add explicit toHaveBeenCalledTimes(1) assertions to prevent regressions, but the current test setup will still detect actual behavioral changes. All batching and parallelization logic is correct and performance gains are substantial.
  • Iterator action test should include explicit call-count assertions for getWorkflowRunOrFail in the reset-loop and completion test cases.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cron Job fires] --> B[Load all active workspaces]
    B --> C{Batch workspaces<br/>in groups of 50}
    C --> D[Promise.allSettled per batch]
    D --> E{checkAndEnqueue<br/>per workspace}
    E -->|lite context| F[loadLiteWorkspaceContext<br/>skip flags/perms/indexes]
    F --> G{hasRuns/hasStaledRuns<br/>hasNotStartedRuns?}
    G -->|yes| H[Enqueue job]
    G -->|no| I[Skip]
    H --> J[WorkflowCleanWorkflowRunsJob]
    J --> K[deleteOldRuns loop<br/>DELETE … RETURNING, BATCH=200]
    K -->|deletedCount > 0| K
    K -->|done| L[deleteExcessRunsPerWorkflow loop<br/>CTE + DELETE … RETURNING, BATCH=200]
    L -->|deletedCount > 0| L
    L -->|done| M[Log total deleted]
Loading

Last reviewed commit: 4a7995a

`,
);

deletedCount = result[0].length;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash bug: result[0].length always causes TypeError or incorrect batching

dataSource.query(DELETE ... RETURNING id) returns an array of row objects, e.g. [{id: 'uuid1'}, {id: 'uuid2'}, ...]. Therefore:

  • result[0] is the first row object {id: 'uuid1'}, not the rows array
  • result[0].length is undefined on a row object — making deletedCount = undefined and the while (undefined > 0) check always false

This has two consequences:

  1. The batch loop always exits after one iteration, so only the first 200 rows are ever deleted per job run
  2. When there are no rows to delete (e.g. when only deleteExcessRunsPerWorkflow has work but deleteOldRuns does not), result is [], result[0] is undefined, and accessing undefined.length throws a TypeError, crashing the job

The same issue exists at the equivalent line in deleteExcessRunsPerWorkflow (line 129).

The fix is to use result.length in both places:

Suggested change
deletedCount = result[0].length;
deletedCount = result.length;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeORM's dataSource.query() returns a tuple [rows, affectedCount]

await workflowRunRepository.delete(workflowRunToDelete.id);
}
deletedCount = result[0].length;
totalDeleted += deletedCount;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same result[0].length bug in deleteExcessRunsPerWorkflow

Same issue as in deleteOldRuns above — result is the array of returned row objects from TypeORM, so result[0] is a single row object {id: '...'} with no .length. The batch loop will always exit after the first iteration.

Suggested change
totalDeleted += deletedCount;
deletedCount = result.length;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see answer #18376 (comment)

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 12 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/twenty-server/src/modules/workflow/workflow-runner/workflow-run-queue/jobs/workflow-clean-workflow-runs.job.ts">

<violation number="1" location="packages/twenty-server/src/modules/workflow/workflow-runner/workflow-run-queue/jobs/workflow-clean-workflow-runs.job.ts:88">
P2: `dataSource.query` results are row arrays; `result[0].length` is undefined and breaks the batching logic. Count rows with `result.length` instead.</violation>
</file>

<file name="packages/twenty-server/src/modules/workflow/workflow-runner/jobs/run-workflow.job.ts">

<violation number="1" location="packages/twenty-server/src/modules/workflow/workflow-runner/jobs/run-workflow.job.ts:165">
P1: The new `hasStepsToSkipOrFailSafely` check uses `isDefined` instead of array length, so empty skip/fail lists are treated as work to do and can prevent terminal workflow runs from being marked COMPLETED.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@thomtrp
Copy link
Copy Markdown
Contributor Author

thomtrp commented Mar 4, 2026

@greptile-ai

@thomtrp thomtrp force-pushed the tt-improve-workflow-perfs branch from 9412aed to 4a7995a Compare March 4, 2026 14:18
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

packages/twenty-server/src/modules/workflow/workflow-executor/workflow-actions/iterator/__tests__/iterator-action.workflow-action.spec.ts, line 208
The PR removes one of the two mockResolvedValueOnce setups for getWorkflowRunOrFail to reflect the optimization (one fewer DB call), but the test lacks an explicit call count assertion. If a regression re-introduces the second call, the test will silently pass — the extra call would return undefined, potentially causing subtle failures rather than a clear assertion error.

Consider adding an explicit assertion to lock in this behavior:

      const result = await service.execute(input);

      expect(
        workflowRunWorkspaceService.getWorkflowRunOrFail,
      ).toHaveBeenCalledTimes(1);

      expect(result).toEqual({

Context Used: Rule from dashboard - Check that mocked functions in tests are called exactly once with the expected arguments. (source)

let deletedCount: number;

do {
const result = await this.dataSource.query(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really can't avoid raw queries, can we at least write it like this

const result = await this.dataSource.query(
  `
    DELETE FROM ${schemaName}."workflowRun"
    WHERE id IN (
      SELECT id FROM ${schemaName}."workflowRun"
      WHERE status IN ($1, $2)
        AND "createdAt" < NOW() - MAKE_INTERVAL(days => $3)
      LIMIT $4
    )
    RETURNING id;
  `,
  [
    WorkflowRunStatus.COMPLETED,
    WorkflowRunStatus.FAILED,
    RUNS_TO_CLEAN_THRESHOLD_DAYS,
    batchSize,
  ],
);

@@ -81,6 +87,21 @@ export class WorkflowRunEnqueueCronJob {
);
}

private async checkAndEnqueue(workspaceId: string): Promise<boolean> {
const hasNotStartedRuns = await this.hasNotStartedRuns(workspaceId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we actually want to run 50 times executeInWorkspaceContext concurrently (as we know this is quite heavy). Or if we really want to do that I'd start with something lower like 10 🤔

authContext,
flatObjectMetadataMaps,
flatFieldMetadataMaps,
flatIndexMaps: {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Ideally those maps should throw if they are accessed in the ORM in "lite" mode. We should allow this new mode only in controlled paths where we know a limited subset of cache is accessed otherwise some things could fail silentely with empty maps

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked quickly AI and looks a bit complicated. Let me know if you want that we dive into this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked quickly AI and looks a bit complicated. Let me know if you want that we dive into this

Ah @charlesBochet already merged it. Ok @thomtrp I'll show you what I had in mind tmr. 👍

@charlesBochet charlesBochet merged commit 911a46a into main Mar 4, 2026
66 checks passed
@charlesBochet charlesBochet deleted the tt-improve-workflow-perfs branch March 4, 2026 17:29
@twenty-eng-sync
Copy link
Copy Markdown

Hey @thomtrp! After you've done the QA of your Pull Request, you can mark it as done here. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants