There are two modes to run dataflow in SDF: ephemeral and worker. When CLI is run in ephemeral mode, it will run the dataflow in the same process as the CLI. The dataflow will be terminated upon exiting CLI. This is useful for development dataflow as well as testing the package.
Worker is a process that runs the dataflows continuously. Unlike ephemeral CLI, worker is long running process. That means, when you exit CLI, dataflow does not terminate. It is primary used for running dataflow in production environment. It can also be used in development environment to run dataflow when there is no need to test package.
Worker can run anywhere as long as it can connect to the Fluvio cluster. Worker can be run in data center, cloud, or edge device. There is no limit on number of workers that can be run. Each worker can run multiple dataflows. It is recommended to a run a single worker per machine.
SDF can target different worker by selecting the worker profile. Worker profile is associated with the Fluvio profile. When you switch the Fluvio profile, SDF will automatically switch the worker profile. Once you have selected the worker, same worker will be used for all dataflow operation until you selecte different worker.
By default, SDF will run the dataflow in the worker. If you are using InfinyOn Cloud, the worker will be provisioned and automatically registered in the profile.
Using Host Worker for Local Cluster
If you are using local cluster, you need either create Host
or register the Remote
worker. Easiest is to create Host
worker. Host worker run in the same machine as the CLI.
To create host worker, you can use the following command:
$> sdf worker create <name>
This will spawn the worker process in the background. It will run until you shutdown the worker or machine is rebooted. The name can be anything as long as it is unique for your machine since profile are not shared across different machines.
Once you have created the worker, You can list them:
$> sdf worker create main
Worker `main` created for cluster: `local`
$> sdf worker list
NAME TYPE CLUSTER WORKER ID
* main Host local 7fd7eda3-2738-41ef-8edc-9f04e500b919
The *
indicates the current selected worker. The worker id is internal unique identifier for current fluvio cluster. Unless specified, it will be generated by the system.
SDF only support running a single HOST worker for each machine since a single worker can support many dataflow. If you try to create another worker, you will get an error message.
$ sdf worker create main2
$ Starting worker: main2
There is already a host worker with pid 20686 running. Please terminate it first
Shutting down worker will terminate all running dataflow and worker process.
$> sdf worker shutdown main
sdf worker shutdown main
Shutting down pid: 20688
Shutting down pid: 20686
Host worker: main has been shutdown
Even though host worker is shutdown and removed from the profile, the dataflow files and state are still persisted. You can restart the worker and the dataflow will resume.
For example, if you have dataflow fraud-detector
and car-processor
running in the worker and you shut down the worker, the dataflow process will be terminated. But you can resume by recreating the HOST worker.
$> sdf worker create main
The local worker store the dataflow state in the local file system. The dataflow state is stored in the ~/.sdf/<cluster>/worker/<dataflow>
.
Sor for local
cluster, files will be stored in ~/.sdf/local/worker/dataflows
.
if you have delete the fluvio cluster, worker need to be manually shutdown and created again. This limitation will be removed in the future release
Remote Worker
Remote worker is a worker runs in different machine from the CLI. It is typicall set up by devops team for production environment.
Typical lifecycle for using remote worker is:
- Start remote worker in the server.
- Register the worker with the name in your machine.
- Run the dataflow in the remote worker.
- Unregister the worker when it is no longer needed. This doesn't shutdown the worker but remove it from the profile.
Note that there are many ways to manage the remote worker. You can use Kubernetes, Docker, Systemd, Terraform, Ansible, or any other tool that can manage the server process and ensure it can restart when server is rebooted.
InfinyOn cloud is good example of remote worker. When you create a cluster in InfinyOn cloud, it will automatically provision the worker for you. It will also sync the profile when cluster is created.
Once know there are remote worker, you can discover them using sdf worker list -all
command.
$> sdf worker list -all
NAME TYPE CLUSTER WORKER ID
* main Host local 7fd7eda3-2738-41ef-8edc-9f04e500b919
N/A Remote local edg2-worker-id
These shows a host worker in your local machine and remote worker with id edg2-worker-id
running somewhere. To register the remote worker, you can use register
command.
$> sdf worker register <name> <worker-id>
Example, registering the remote worker with name edge2
and worker id edg2-worker-id
.
$> sdf worker register edge2 edg2-worker-id
Worker `edge2` is registered for cluster: `local`
You can switch among worker using switch
command.
$> sdf worker switch <worker_profile>
To unregister the worker after you are done with and no longer need, you can use unregister
command:
$> sdf worker unregister <name>
Listing and switching the worker
To list all known worker, you can use list
command:
$> sdf worker list
NAME TYPE CLUSTER WORKER ID
* main Host local 7fd7eda3-2738-41ef-8edc-9f04e500b919
edge2 Remote local edg2-worker-id
To switch the worker, you can use switch
command:
$> sdf worker switch <worker-name>
This assumes worker-name is already created or registered.
Managing dataflow in worker
When you are running dataflow in the worker, it will indicate name of the worker in the prompt:
$> sdf run
[main] >> show state
Listing and selecting dataflow
To list all dataflow running in the worker, you can use dataflow list
command:
$> sdf dataflow list
[jolly-pond]>> show dataflow
Dataflow Status
wordcount-window-simple running
* user-job-map running
[jolly-pond]>>
Other commands like show state
requires active dataflow. If there is no active dataflow, it will show error message.
[jolly-pond]>> show state
No dataflow selected. Run `select dataflow`
[jolly-pond]>>
To select the dataflow, you can use dataflow select
command:
[jolly-pond]>> select dataflow wordcount-window-simple
dataflow switched to: wordcount-window-simple
Deleting dataflow
To delete the dataflow, you can use dataflow delete
command:
$> sdf dataflow delete user-job-map
[jolly-pond]>> show dataflow
Dataflow Status
wordcount-window-simple running
Note that since user-job-map
is deleted, it is no longer listed in the dataflow list.
Using worker in InfinyOn Cloud
With InfinyOn Cloud, there is no need to manage the worker. It provisions the worker for you. It also sync profile when cluster is created.
For example, creating cloud cluster will automatically provision and create SDF worker profile.
$> fluvio cloud login --use-oauth2
$> fluvio cloud cluster create
Creating cluster...
Done!
Downloading cluster config
Registered sdf worker: jellyfish
Switched to new profile: jellyfish
You can unregister the cloud worker like any other remote worker.
Advanced: Starting remote worker
To start worker as remote worker, you can use launch
command:
$> sdf worker launch --base-dir <dir> --worker-id <worker-id>
where base-dir
and worker-id
are optional parameters. If you don't specify base-dir
, it will use the default directory: /sdf
.
If you don't specify worker-id
, it will generate unique id for you.
This script is typically used by devops team to start the worker in the server.