-
-
Notifications
You must be signed in to change notification settings - Fork 787
fix: otel logs better DynamicFlushScheduler #2318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The changes introduce a more flexible and adaptive dynamic flush scheduler to address production issues where the system wasn't flushing data fast enough, causing memory growth and crashes. This issue arises from the existing scheduler handling only a single flush at a time, limiting concurrency and failing to cope with the influx of logs. - Added configuration options for setting minimum and maximum concurrency levels, maximum batch size, and memory pressure threshold. These parameters ensure that flush operations adjust dynamically based on workload and pressure. - Implemented `pLimit` to facilitate concurrent flush operations, with adjustments made according to batch queue length and memory pressure. - Metrics reporting improvements were added to monitor the dynamic behavior of the flush scheduler, aiding in identifying performance issues and optimizing the operation accordingly.
This change introduces load shedding mechanisms to manage TaskEvent records, particularly those of kind LOG, when the system experiences high volumes and is unable to flush to the database in a timely manner. The addition aims to prevent overwhelming the system and ensure critical tasks are prioritized. - Added configuration options for `loadSheddingThreshold` and `loadSheddingEnabled` in multiple modules to activate load shedding. - Introduced `isDroppableEvent` function to allow specific events to be dropped when load shedding is enabled. - Ensured metrics are updated to reflect dropped events and load shedding status, providing visibility into system performance during high load conditions. - Updated loggers to inform about load shedding state changes, ensuring timely awareness of load management activities.
The 'queuePressure' variable was being used without being defined in the DynamicFlushScheduler class, causing potential runtime errors. This commit adds the missing definition and ensures that the variable is correctly calculated based on the 'totalQueuedItems' and 'memoryPressureThreshold'. - Addressed code inconsistencies and improved formatting. - Defined 'queuePressure' in the 'adjustConcurrency' method to prevent potential undefined errors. - Enhanced readability by maintaining consistent spacing and format across the file, contributing to the stability and maintainability of the code. - Adjusted batch size logic based on the newly defined 'queuePressure' variable.
The concurrency adjustment logic in the dynamic flush scheduler has been refactored to improve clarity and maintainability. This change moves the calculation of pressure metrics outside of the conditional blocks to ensure they are always determined prior to decision-making. - The queue pressure and time since last flush calculations were moved up in the code to be independent of the 'backOff' condition. - This refactor sets up the groundwork for more reliable concurrency scaling and better performance monitoring capabilities. The overall logic of adjusting concurrency based on system pressure metrics remains unchanged. This adjustment addresses ongoing issues with the scheduler that were not resolved by previous changes.
|
WalkthroughThis change introduces new environment variable schema entries related to event processing concurrency, batch size, memory pressure, and load shedding controls. The Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
apps/webapp/app/v3/eventRepository.server.ts (1)
221-231
: Consider making droppable event types configurable.The implementation correctly passes configuration to the scheduler. However, the
isDroppableEvent
callback is hardcoded to only dropLOG
events. Consider making this configurable through environment variables or configuration to allow flexibility in load shedding strategies.apps/webapp/app/v3/dynamicFlushScheduler.server.ts (2)
62-75
: Add validation for concurrency bounds.While the initialization logic is sound, consider adding validation to ensure
minConcurrency <= maxConcurrency
to prevent configuration errors.// Initialize dynamic scaling parameters this.minConcurrency = config.minConcurrency ?? 1; this.maxConcurrency = config.maxConcurrency ?? 10; +if (this.minConcurrency > this.maxConcurrency) { + throw new Error(`minConcurrency (${this.minConcurrency}) cannot be greater than maxConcurrency (${this.maxConcurrency})`); +} this.maxBatchSize = config.maxBatchSize ?? config.batchSize * 5;
283-305
: Consider making metrics reporting interval configurable.The 30-second interval for metrics reporting is hardcoded. For high-volume systems, this could be too frequent, while for debugging it might not be frequent enough.
Consider adding a configuration option for the metrics reporting interval:
+ metricsReportingInterval?: number; // milliseconds // In constructor: +const metricsInterval = config.metricsReportingInterval ?? 30000; // In startMetricsReporter: - }, 30000); + }, metricsInterval);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
references/hello-world/src/trigger/example.ts
is excluded by!references/**
📒 Files selected for processing (3)
apps/webapp/app/env.server.ts
(1 hunks)apps/webapp/app/v3/dynamicFlushScheduler.server.ts
(2 hunks)apps/webapp/app/v3/eventRepository.server.ts
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.github/copilot-instructions.md)
**/*.{ts,tsx}
: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
apps/webapp/app/env.server.ts
apps/webapp/app/v3/dynamicFlushScheduler.server.ts
apps/webapp/app/v3/eventRepository.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.github/copilot-instructions.md)
We use zod a lot in packages/core and in the webapp
Files:
apps/webapp/app/env.server.ts
apps/webapp/app/v3/dynamicFlushScheduler.server.ts
apps/webapp/app/v3/eventRepository.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit Inference Engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}
: In the webapp, all environment variables must be accessed through theenv
export ofenv.server.ts
, instead of directly accessingprocess.env
.
When importing from@trigger.dev/core
in the webapp, never import from the root@trigger.dev/core
path; always use one of the subpath exports as defined in the package's package.json.
Files:
apps/webapp/app/env.server.ts
apps/webapp/app/v3/dynamicFlushScheduler.server.ts
apps/webapp/app/v3/eventRepository.server.ts
🧠 Learnings (2)
apps/webapp/app/env.server.ts (3)
Learnt from: CR
PR: triggerdotdev/trigger.dev#0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-07-18T17:49:47.180Z
Learning: Applies to apps/webapp/**/*.{ts,tsx} : In the webapp, all environment variables must be accessed through the env
export of env.server.ts
, instead of directly accessing process.env
.
Learnt from: CR
PR: triggerdotdev/trigger.dev#0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-07-18T17:50:25.014Z
Learning: Applies to trigger.config.ts : Global lifecycle hooks, telemetry, runtime, machine settings, log level, max duration, and build configuration must be set in trigger.config.ts
as shown.
Learnt from: CR
PR: triggerdotdev/trigger.dev#0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-07-18T17:49:47.180Z
Learning: Applies to apps/webapp/app/**/*.test.{ts,tsx} : Test files in the webapp should not import env.server.ts
, either directly or indirectly. Tests should only import classes and functions from files matching app/**/*.ts
of the webapp, and those files should not use environment variables directly; everything should be passed through as options instead.
apps/webapp/app/v3/eventRepository.server.ts (1)
Learnt from: CR
PR: triggerdotdev/trigger.dev#0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-07-18T17:50:25.014Z
Learning: Applies to /trigger//*.{ts,tsx,js,jsx} : When using retry, queue, machine, or maxDuration options, configure them as shown in the examples for Trigger.dev tasks.
🧬 Code Graph Analysis (1)
apps/webapp/app/v3/eventRepository.server.ts (3)
apps/webapp/app/env.server.ts (1)
env
(1030-1030)internal-packages/tracing/src/index.ts (1)
Gauge
(23-23)apps/webapp/app/metrics.server.ts (1)
metricsRegister
(5-5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 10)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 10)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (10, 10)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 10)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 10)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (9, 10)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 10)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 10)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 10)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 10)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (11)
apps/webapp/app/env.server.ts (1)
255-260
: LGTM! Environment variables follow existing patterns.The new environment variables for event processing configuration are well-structured and consistent with the existing schema:
- Proper use of
coerce.number().int()
for numeric values- Sensible defaults that allow gradual scaling
- Consistent naming convention with
EVENTS_
prefix- Boolean flag using string pattern matches other flags in the file
apps/webapp/app/v3/eventRepository.server.ts (4)
110-116
: New configuration parameters properly integrated.The optional parameters for concurrency control and load shedding are well-typed and follow TypeScript best practices.
208-211
: Good addition of scheduler status getter.The getter provides clean access to internal scheduler metrics for monitoring purposes.
1347-1352
: Environment variables correctly integrated.The new environment variables are properly read and passed to the EventRepository constructor.
1372-1432
: Well-structured Prometheus metrics for scheduler monitoring.The metrics follow the existing pattern and provide comprehensive visibility into the flush scheduler's behavior. The use of the
collect()
function to query current status is the correct approach for Gauge metrics.apps/webapp/app/v3/dynamicFlushScheduler.server.ts (6)
1-3
: LGTM! Good choice of concurrency library.The addition of
p-limit
for managing concurrent operations is appropriate for this use case.
9-17
: Well-structured configuration options.The new configuration parameters are properly typed and documented. The optional parameters with sensible defaults provide flexibility while maintaining backward compatibility.
84-118
: Excellent load shedding implementation.The load shedding logic includes:
- Proper threshold checking
- Event tracking by kind for observability
- Hysteresis to prevent oscillation (deactivates at 80% of threshold)
- Clear logging of shedding events
158-226
: Well-designed concurrent batch flushing.The implementation correctly:
- Dequeues batches up to the concurrency limit
- Handles errors by re-queuing failed batches
- Tracks consecutive failures for backoff
- Uses Promise.allSettled for proper concurrent execution
- Recursively checks for more batches after completion
307-334
: Clean implementation of load shedding and event kind extraction.The helper methods are well-implemented:
- Load shedding respects the droppable event predicate
- Safe type checking for event kind extraction
- Proper separation of kept and dropped items
358-373
: Robust graceful shutdown implementation.The shutdown method properly:
- Clears the flush timer
- Flushes remaining items in the current batch
- Waits for all pending operations to complete
- Uses a polling approach to ensure completion
Description
This pull request enhances the dynamic flush scheduler with the following key changes:
pLimit
, dynamically adjusting concurrency based on queue length and memory pressure.queuePressure
variable in the scheduler to prevent runtime errors and improve batch size logic.