No easy solution unfortunately 😞 (at least not one we’ve found yet). We’ve explored a few cloud providers - AWS, Oracle cloud, Lambda labs (
https://lambdalabs.com/).
Lambda labs has been the best in terms of price and user experience for experimentation, spiky training workloads etc.
AWS has been our go-to choice for production workloads (primarily coz the rest of our infra is here) — pricier, and A100s are not as available. Depending on the exact usecase, their Tranium chips (trn instances) could be a good choice too.
Happy to chat more!