Load-balancing & Scaling
v4.5.1 the compute server exposes metrics that describe the current load, which allow for a scalable setup.
/ready endpoint exposes information for wether the compute server instance is able to process a request that would occupy a certain amount of resources (specified via
READYNESS_REQUIRED_FREE_CU_COUNT . This is set to a reasonable value by default, so you won’t need to change it normally). If the readiness probe fails, the LoadBalancer should hand the request to a different instance.
Sometimes it might happen that the request will still be rejected due to a race condition between multiple incoming requests or otherwise. In this case the web SDK features a retry logic that can be configured via some variables you may modify. See here: https://reactivereality.gitlab.io/sdks/pictofitcore-web/classes/ComputeServer.html#RETRY_BACKOFF_MULTIPLIER_MS
The ComputeServer exports a bunch of metrics in the prometheus format under the
One of the metrics exposed named
rr_pictofit_computeserver_cu_reserved_percent describes the current load of the ComputeServer with a value between
1, which can used to determine the average load over a time period to spin up more instances.
We recommend to aggregate the average of this value over a time period and scale your amount of instances in relation to that.
The Compute Server is resource aware and will not compute any requests that he does currently not have the required resources for.
For a single instance we recommend
4 CPUs and
8 GB of RAM. Depending on your common workload you may also tune this value, but its generally not recommended to go much lower to be able to handle at least a few heavy request at once.
Not setting any resource limits will eventually lead to the container using up all memory on the system, which can cause an OOM. Therefore its strongly recommended to set expected resource requirements for your deployment.