Self-Hosting Hardening
Minimum hardening baseline
- Terminate TLS in front of the API.
- Run the service as a non-root user.
- Restrict inbound access to required ports only.
- Isolate renderer sidecars from unnecessary network paths.
That baseline is the starting point, not the finish line. A self-hosted scraper talks to untrusted public pages and can sit close to valuable internal systems, so it deserves the same discipline as any other internet-facing API.
Network and Access Control
- Put a reverse proxy or gateway in front of the service.
- Restrict who can reach the API by network, identity, or both.
- Avoid exposing internal health or admin surfaces to the public internet.
- If browser rendering is enabled, isolate the renderer from internal systems it does not need to reach.
Runtime Isolation
Treat page fetching and browser rendering as higher-risk components than your application logic.
- run them with the least privilege possible,
- keep filesystem access narrow,
- and isolate sidecars so a renderer problem does not automatically become a broader platform problem.
Secrets and Keys
- Keep API keys, proxy credentials, and LLM keys out of image builds.
- Inject secrets at runtime through your platform's secret store.
- Rotate keys during environment changes or incident response, not only on a fixed calendar.
Operational guidance
- Rotate API keys during deployment cutovers.
- Keep browser-rendering dependencies on the smallest possible surface area.
- Expose
/healthonly where your load balancer or monitoring needs it. - Review warning-heavy targets separately; they often indicate anti-bot defenses rather than renderer bugs.
Monitoring and Auditability
At minimum, watch:
- API error rate,
- warning frequency,
- crawl job duration,
- renderer availability,
- and resource spikes on the browser sidecar.
Keep enough logs to answer three questions after an incident:
- what URL or workload triggered the issue,
- whether it was an engine problem or a target-site problem,
- and what data, if any, was still returned.
Example Hardening Sequence
If you are moving from a dev VM to a real environment, the order should usually be:
- put a reverse proxy and TLS in front,
- add auth and external rate limiting,
- move secrets into runtime injection,
- restrict network access around the API and any renderer sidecar,
- then enable monitoring and alerting on warnings, failures, and resource spikes.
That order keeps the riskiest exposure points under control early instead of treating hardening as a final cleanup step.
When To Isolate the Renderer More Aggressively
Stronger isolation is worth it when:
- your targets are highly dynamic and require frequent JS rendering,
- the service runs close to internal systems with sensitive access,
- or many tenants or workloads share the same cluster.
In those cases, a renderer problem should not become an easy pivot into the rest of your infrastructure.
Common Mistakes
- Leaving
/healthbroadly exposed when only an internal load balancer needs it. - Running the service with broader filesystem or network access than the scraping workload requires.
- Keeping incident logs too thin to separate target-site anti-bot issues from engine regressions.
Pair this page with rate limits and error codes so operational hardening and runtime diagnostics are documented together.