[Workflow Orchestrator] Port Workflow-core To Rust For F2 Distributed Pipeline Management

by ADMIN 90 views

Introduction

The f2 image processing pipeline requires a robust and efficient workflow orchestration engine. This article details the effort to port the proven workflow-core engine to Rust, providing a stable, performant, and distributed solution for managing complex data processing and business logic pipelines across our microservices. This foundational infrastructure will enable the entire image processing workflow, handling parallel processing branches, sync points, error recovery, and state persistence across distributed services. This article will cover the user story, context, technical details, acceptance criteria, and other crucial aspects of this project.

User Story

As a f2 system architect I need a stable, performant, distributed workflow engine implemented in Rust So that we can manage complex data processing and business logic pipelines across our microservices in a lightweight, no-nonsense way.

Context

The existing image processing pipeline at f2 requires the orchestration of multiple Rust microservices, including ArcFace, face detection, and EXIF extraction. These microservices have complex state management, error handling, and retry logic requirements. To meet these needs, we are porting the workflow-core engine from danielgerlag (C#/.NET) to Rust. This workflow orchestrator will provide the necessary orchestration layer to manage these microservices effectively.

The need for a robust workflow orchestrator is critical for the f2 image processing pipeline. This system will handle parallel processing branches, synchronization points, error recovery, and state persistence across distributed services. The Rust implementation will ensure the performance and stability required for our high-throughput image processing tasks. By porting workflow-core to Rust, we aim to create a foundational infrastructure component that supports the entire image processing workflow. This includes managing dependencies between microservices, handling failures gracefully, and providing a clear view of the workflow execution state. This workflow management is essential for maintaining the efficiency and reliability of the image processing pipeline.

The workflow orchestrator is not just a technical component; it's a strategic enabler for f2. It allows us to build and manage complex data processing pipelines with confidence, knowing that the underlying infrastructure can handle the load. The Rust-based orchestrator will offer significant advantages over other potential solutions, particularly in terms of performance and resource utilization. This efficiency is crucial for scaling our image processing capabilities and supporting future growth. Moreover, the orchestrator will provide a centralized point for monitoring and managing workflows, making it easier to identify and resolve issues. This centralized control is essential for ensuring the smooth operation of the f2 image processing pipeline.

This project is a significant investment in f2's infrastructure, and it will pay dividends in the form of increased efficiency, reliability, and scalability. The Rust-based workflow orchestrator will be a key component in our ability to deliver high-quality image processing services. It will also enable us to innovate and develop new features more quickly, as we will have a solid foundation upon which to build. The workflow engine will be designed to be flexible and extensible, allowing us to adapt to changing requirements and new technologies. This adaptability is critical in the fast-paced world of image processing, where new algorithms and techniques are constantly emerging.

Target Service

New Service - Workflow Orchestrator

Technical Details

Dependencies

Core foundational service - no service dependencies initially.

gRPC Interfaces

  • WorkflowDefinition service (define/register workflows)
  • WorkflowExecution service (start/pause/resume/cancel workflows)
  • WorkflowState service (query status, get results)
  • StepExecution service (for microservices to report completion)

The gRPC interfaces are a critical part of the workflow orchestrator. They define how different services within the f2 ecosystem will interact with the orchestrator. The WorkflowDefinition service allows for the registration and definition of workflows, specifying the steps involved and their dependencies. This service is essential for configuring the workflow engine to execute specific tasks. The WorkflowExecution service provides the ability to start, pause, resume, and cancel workflows, offering control over the execution lifecycle. This is crucial for managing the orchestration process and handling unexpected situations. The WorkflowState service allows for querying the status of workflows and retrieving results, providing visibility into the execution progress. This service is essential for monitoring and debugging workflows.

The StepExecution service is used by microservices to report the completion of individual steps within a workflow. This allows the orchestrator to track the progress of each step and trigger the next step in the sequence. This mechanism is fundamental to the orchestration process, ensuring that steps are executed in the correct order and that dependencies are met. The gRPC interfaces are designed to be robust and efficient, allowing for high-throughput communication between services. They also provide a standardized way for services to interact with the orchestrator, making it easier to integrate new services into the f2 ecosystem. The use of gRPC ensures that communication is both performant and reliable, which is critical for the workflow orchestrator to function effectively.

The design of these gRPC interfaces is driven by the need for a flexible and scalable workflow orchestration system. They allow for complex workflows to be defined and executed, with the orchestrator managing the state and dependencies. The interfaces are also designed to be easy to use, with clear and concise messages that simplify integration with other services. This ease of use is essential for encouraging adoption of the orchestrator within the f2 ecosystem. The gRPC interfaces are a key enabler for the f2 image processing pipeline, allowing us to build and manage complex data processing workflows with confidence.

Database Tables

  • workflow_definitions (workflow templates)
  • workflow_instances (active/completed executions)
  • workflow_steps (individual step state)
  • workflow_events (audit log)

The database tables are the backbone of the workflow orchestrator, providing persistent storage for workflow definitions, instances, steps, and events. The workflow_definitions table stores the templates for workflows, defining the steps and their relationships. This table is essential for the orchestrator to understand how to execute a particular workflow. The workflow_instances table tracks active and completed executions, providing a record of each workflow run. This table is crucial for monitoring and managing workflows, as it allows us to track their progress and identify any issues. The workflow_steps table stores the state of individual steps within a workflow, including their execution status and any results. This table is critical for maintaining the integrity of the workflow execution and ensuring that steps are executed in the correct order.

The workflow_events table serves as an audit log, recording significant events that occur during workflow execution. This table is essential for debugging and troubleshooting issues, as it provides a detailed history of what happened during a workflow run. The database schema is designed to be efficient and scalable, allowing the orchestrator to handle a large number of workflows and steps. The use of a relational database ensures that the data is consistent and reliable, which is critical for the integrity of the workflow orchestration process. The database is also designed to be flexible, allowing us to add new tables and columns as needed to support future features and requirements. This flexibility is essential for ensuring that the orchestrator can adapt to changing needs.

The design of the database tables is driven by the need for a persistent and reliable storage layer for the workflow orchestrator. The tables are designed to capture all the necessary information about workflows, steps, and events, allowing us to monitor and manage them effectively. The database is also designed to be performant, ensuring that the orchestrator can quickly access and update the data it needs. This performance is critical for the overall efficiency of the image processing pipeline. The database schema is a key component of the workflow orchestrator, providing the foundation for its functionality and reliability.

Performance Requirements

  • Handle 1000+ concurrent workflow instances
  • Step execution latency < 50ms
  • Persistent state storage
  • Graceful degradation under load

Security Considerations

  • Service-to-service authentication
  • Workflow definition validation
  • Step execution authorization

Acceptance Criteria (Given/When/Then)

Scenario 1: Basic Workflow Execution

Given a workflow definition is registered with the orchestrator When a workflow instance is started with input parameters Then the orchestrator executes steps in the defined order And each step's completion triggers the next step And workflow state is persisted throughout execution And final results are available via gRPC query

Scenario 2: Parallel Step Execution

Given a workflow definition with parallel branches (like EXIF + Face processing) When the workflow reaches the parallel section Then both branches execute simultaneously And the workflow waits at the sync point for both branches And execution continues only after both branches complete And partial results from each branch are available independently

Scenario 3: Step Failure and Retry

Given a workflow step fails during execution When the failure is reported to the orchestrator Then the orchestrator applies the configured retry policy And failed step is retried up to max retry limit And workflow is marked as failed if max retries exceeded And failure details are logged for debugging

Scenario 4: Workflow Cancellation

Given a workflow instance is currently executing When a cancellation request is received Then currently executing steps are allowed to complete And no new steps are started And workflow status is marked as cancelled And cleanup actions are executed if defined

Scenario 5: Distributed Step Execution

Given workflow steps that execute on different microservices When a step is assigned to a microservice Then the orchestrator sends execution request via gRPC And waits for completion callback from the microservice And handles service unavailability with appropriate retries And tracks execution progress across service boundaries

Technical Acceptance Criteria

  • [ ] Code follows f2 Rust style guidelines
  • [ ] gRPC service definitions are documented
  • [ ] Unit tests cover all scenarios above
  • [ ] Integration tests verify service communication
  • [ ] Error handling includes proper logging
  • [ ] Performance meets f2 requirements (no Python bottlenecks)
  • [ ] Security considerations addressed
  • [ ] Documentation updated

The technical acceptance criteria are essential for ensuring that the workflow orchestrator meets the required standards of quality and performance. Adhering to the f2 Rust style guidelines ensures that the code is consistent and maintainable. Documenting the gRPC service definitions is crucial for making the orchestrator accessible and understandable to other services. Comprehensive unit tests are necessary to validate the functionality of the orchestrator and ensure that it behaves as expected. Integration tests are required to verify the communication between the orchestrator and other services, ensuring that they work together seamlessly.

Proper error handling with detailed logging is crucial for debugging and troubleshooting issues. The orchestrator must meet the performance requirements of f2, including handling a high number of concurrent workflows and maintaining low step execution latency. Addressing security considerations is paramount, including service-to-service authentication, workflow definition validation, and step execution authorization. Finally, updated documentation is essential for making the orchestrator usable and understandable to developers and operators. These technical acceptance criteria are a key part of the development process, ensuring that the workflow orchestrator is a robust and reliable component of the f2 infrastructure.

These criteria cover a wide range of aspects, from code quality and documentation to performance and security. By meeting these criteria, we can ensure that the workflow orchestrator is a valuable asset for f2. The technical acceptance criteria also serve as a checklist for the development team, ensuring that all critical aspects of the orchestrator are addressed before it is deployed to production. This thoroughness is essential for minimizing the risk of issues and ensuring the smooth operation of the image processing pipeline.

Definition of Done

  • [ ] All acceptance criteria pass
  • [ ] Code reviewed and approved
  • [ ] Tests passing in CI
  • [ ] Documentation complete
  • [ ] Deployed to staging environment
  • [ ] Integration with workflow orchestrator verified

The definition of done provides a clear and concise set of criteria that must be met before the workflow orchestrator project can be considered complete. Passing all acceptance criteria is the primary requirement, ensuring that the orchestrator meets the specified functionality and performance standards. Code review and approval are essential for ensuring code quality and maintainability. Passing tests in continuous integration (CI) provides automated validation of the orchestrator's functionality. Complete documentation is crucial for making the orchestrator usable and understandable to others.

Deployment to a staging environment allows for thorough testing and validation of the orchestrator in a production-like setting. Verification of integration with the workflow orchestrator ensures that the new orchestrator works seamlessly with the existing infrastructure. These criteria provide a clear and objective measure of when the project is complete, reducing ambiguity and ensuring that all critical aspects are addressed. The definition of done is a key tool for project management, helping to keep the project on track and ensuring that it delivers the expected value. By adhering to these criteria, we can ensure that the workflow orchestrator is a valuable asset for f2.

Meeting these criteria ensures that the workflow orchestrator is not only functional but also well-integrated into the f2 ecosystem and ready for production use. The definition of done serves as a final checklist, ensuring that nothing is overlooked before the project is marked as complete. This thoroughness is essential for minimizing the risk of issues and ensuring the smooth operation of the image processing pipeline.

Dependencies

Blocks: All image processing pipeline issues, database schema setup, microservice implementations Blocked by: None (foundational infrastructure) Related: Database schema design, gRPC service definitions

Complexity

XL - Epic-level work, needs breakdown (2+ weeks)

Priority

Critical - Blocks other work or addresses security issue

Additional Notes

This is the foundational piece that enables the entire f2 architecture. Consider breaking this down into smaller stories:

  1. Core workflow engine (state machine, step execution)
  2. gRPC API layer (external interface)
  3. Database persistence layer
  4. Retry and error handling
  5. Distributed execution coordination
  6. Monitoring and observability

Reference implementation: https://github.com/danielgerlag/workflow-core

Key design principles:

  • Lightweight and performant
  • Clear separation of concerns
  • Robust error handling
  • Observable and debuggable
  • Easy integration with microservices

f2 Community Guidelines

  • [ ] I agree to follow f2's community guidelines and ethics
  • [ ] This work aligns with f2's mission to protect marginalized communities

Conclusion

The port of workflow-core to Rust for the f2 distributed pipeline management is a critical undertaking. It will provide a stable, performant, and scalable solution for orchestrating complex workflows across microservices. This article has outlined the user story, technical details, acceptance criteria, and other essential aspects of this project. By adhering to these guidelines and principles, we can ensure the successful implementation of the workflow orchestrator and its contribution to f2's mission.