This blog is no longer active, you can find my new stuff at: george3d6.com, epistem.ink, ontologi.cc, and phenomenologi.cc



Is a billion-dollar worth of server lying on the ground?

Note: Some details of the stories in this article are slightly altered to protect the privacy of the companies I worked for

It's somewhat anecdotal, but in my work, I often encounter projects that seem to use highly inefficient infrastructure providers, from a cost perspective.

I usually point out that, based on a fairly unbiased hardware comparison, that they could save over half their budget by migrating, and am usually met with a series of almost canned answer about migrations being too difficult due to x,y,z.

I - A representative comparison

I will pick one of the "expensive" and one of the "cheap" server providers, chosen simply based on the fact that I've worked with them a lot, and compare two of their high~ish end servers.

I'm going to take 1 example from a cheap server provider OVH and a somewhat worst machine from AWS.

OVH offers this machine:

For $15,800/year (though it can be paid monthly)

For a close comparison, AWS offers their 4.16xlarge. I'll try to figure out the exact hardware specs on this description:

R4 instances feature up to 64 vCPUs and are powered by two AWS-customized Intel XEON processors based on E5-2686v4 that feature high-memory bandwidth and larger L3 caches to boost the performance of in-memory applications.

So basically let's call it 2x E5-2686v4, though the "real" E5-2686v4 seems to have more cores (both real and virtual) than the AWS version, I'll give AWS the benefit of the doubt and say that their version is more or less the same. I'll also assume AWS's RAM is the same 2666MHz DDR4 EEC2 (basically the best you can get right now) though they don't specify this, but I'll be generous here.

So we have:

For $37,282/year (paid hourly) or $25,771/year (paid upfront)

The OVH server has more memory, it comes with 1TB of very fast storage, and adding more storage is much cheaper than AWS EBS prices (+ you get the option for NVME SSDs connected via PCIe on all servers).

I chose two processors which are fairly similar but running a comparison is still hard. Unlike e.g. RAM, processors are much more synergistic, you can't just look at parameters like nr cores, cache size, and frequency to figure out how well they perform.

Still, these two seem to be pretty close on those parameters and when looking at the benchmarks. It seems that the Gold 6132 is marginally better than the E5-2684-v4. Granted, benchmarking server CPUs is hard, but still, I think it's fair to say that the former has at least a tiny advantage, even if somehow the E5-2684 performs worst of benchmarks than on "real tasks".

So we have 2 servers:

If the first server, the one that is better in literally every way, costs ~16k/year... how much should the other one cost? Well, maybe 10, maybe 12, maybe 14?

I don't know, but the answer certainly shouldn't be "Almost twice as much at 26k/year", that's the kind of answer that indicates something is broken.

In a worst-case scenario, AWS is ~1.6x times as expensive, but again, that's paid yearly. If we compare paid monthly to paid hourly (not exactly fair) we get 37k vs 16k, if we do some napkin math calculations for equivalent storage cost (with equivalent speed via guaranteed iops) we easily get to ~3k/year extra. We have a 40k vs 16k difference, the AWS machine with the worst specs is 250% more expensive.

But whether the worst AWS machine is 160% or 250% as expensive as the OVH one is not the relevant question here, the question is why are we seeing those astronomical differences in terms of cost to being with.

We should consider there are hosting providers cheaper than OVH (e.g. scaleway, potentially online.net, and other such providers you never heard of). On the flip side of the coin, there are server providers such as digital oceans, GC, and Azure that can be more expensive than AWS.

Why?

II - Vendor lock-in hypothesis

The easiest thing to do here is to cry vendor lock-in.

The story goes that you end up using firebase for authentication, then you hire a sysadmin / DevOps guy that knows GC to create your infrastructure there. Then you make use of some fancy google ML service that integrates seamlessly with the GC storage... so on and so forth, until it would cost you a lot more manpower to move away from GC than to pay them a bit extra for whatever compute or storage you could get for less elsewhere.


This is compounded by the fact that most of the time startups are oblivious to the cost of these services.

I switched my personal "infrastructure" from AWS since it ended up costing me over $100/month to maintain. Nowadays I pay $23/month and get a lot more leeway out of my current setup. But I haven't done that with some startups I've worked with or advised, even though the cost savings could have one or two additional zeros added to them. Why?

I can often call the shots regarding hardware at the startups I've worked with, yet I usually can't argue against using AWS or GC... because often enough, the first hit is free. AWS, GC, and Azure are throwing out 10k$ worth of credits like candy, and topping that off with 50-200k$ worth of credit for startups that they think have potential. The catch here is that the credits expire in 1 year, and once that year is done many are probably locked into the vendor.

The startup model is one of exponential growth, most fail and the winners have dozens or hundreds of millions from investments. So what is one or two hundred thousand a year on an IaaS bill?

Well, the answer is almost nothing. I believe the standard AWS offering for free credits is something like 100k$/year. So assuming a startup that uses that for a year gets 10mil in investment, it costs them 1% of their budget a year to maintain that.

The problem is that investment reflects future potential worth, a startup receiving a 10 mil investment is probably operating at a small fraction of the capacity those investors hope it will reach. For the shares to be worth 5x time that original investment, the company might have to scale its operations 20x or 50x, or 100x.

This becomes a problem since you can't run on investment forever, and scaling up 20x suddenly turns that 100k into 2 million a year spent on servers.

Of course, this is just a hypothetical, the numbers here are stand-ins to make a point, not a case-study. From my own experience, that the lock-in funnel looks something like:

  1. Free credits, let's use {expensive infrastructure provider}.
  2. Loads of investment money, let's not waste time switching away from {expensive infrastructure provider}, it's < 1% of our yearly budget.
  3. Turns out that once the company grew, {expensive infrastructure provider} now consists of a double-digit percentage of our yearly expenditure, but it's too late to switch now.

This situation is exacerbated by consolidation (big fish buys little fish). I vividly remember a situation where I found an optimal hardware+software combination for a data processing platform, I think a conservative estimate would be that it was ~5 times cheaper than the vendor lock-in alternative being used at the time.

This happened to make it worth the switch since the startup lacked a generous credit offer for Google cloud. But, as soon as it was "consolidated", I was forced to switch the whole system back to Google cloud, granted a much better GC setup, but one that still involved costs ~2-3x times greater than the original solution.

Why? Well, boils down to the parent company using Google cloud, all their employees knowing how to work with GC, all their contracts having weird security-related clauses composed by many lawyers based on "official security audits" ran on their GC infrastructure, and so on.

However, this leads nicely into my second hypothesis.

III - Employee lock-in hypothesis

Employees end up deciding most of what a company is using internally, including infrastructure providers.

People aggregate along weird lines, to the extent that it wouldn't surprise me if a CTO hired initial engineers that favored his preferred infrastructure provider, even if he didn't actively seek that trait out.

Once the first few employees are fans of a given infrastructure provider, it starts making it into the job specs, because onboarding someone familiar with AWS when you use Azure is a huge pain in the ass. All other things being equal you'd rather have someone familiar with the technology you are already using.

This is compounded by the kind of employees that permeate a given field. If you are developing mobile apps or web apps, for example, it's likely that many engineers you will find will be familiar with Heroku and Digital Oceans. If you are developing whatever the heck people use C# for, I'd bet you'll find people that know how to use Azure. If you are doing machine learning, most people will know a thing or two about google cloud's offers regarding TPUs.

More broadly, this leaves no room for people that want to have a "multi-cloud" infrastructure or use a very little known platform. Either you get engineers that are very versed in the subject, but that will cost extra. Or you consign to having a few experts on the subject handle everything, with the rest of the team having no idea how to boot up a new VM without calling someone up.

Of course, some of you might say, "a Linux machine is just a Linux machine, once you ssh into it it's all the same", and I agree, and most engineers I speak with also agree. But all those engineers also happen to be very expensive to hire, all things considered. Having recently introduced a "please explain to me what how a | is used in a bash shell" question in my interviews, I am surprised by how many people with claimed "DevOps" knowledge can't answer that elementary question given examples and time to think it out (granted, on a ~60 sample size).

But even if you assume all capable software developers don't need provider-specific hand holding to boot up a machine and run commands on it via ssh, things change once you get to orchestration. I can probably make a Kubernetes based infrastructure on GC and AWS, but I might fail at it miserably on Azure. I can create a "serverless" infrastructure on AWS, but not on any other infrastructure provider, at least not without a time commitment twice as long.

Speaking of which...

IV - Abstracting costs via services

I can make an apples to apples comparison between OVH and AWS servers, or GC and Scaleway servers, but I can't make one between GC Vision AI and Sagemaker, or between GC's ai-platform notebooks and IBM's Watson, or between the costs of a lambda-based "serverless" solution and using the OVH API to provision VMs on demand.

The world is moving more towards a "hardware+sofware functionality bundled as a service" model, where you are no longer buying servers with a UNIX OS, you are instead buying some abstract token of storage, memory, and compute time. This not only makes comparisons harder but often forces you into specific choices regarding your code, choices which make the vendor lock-in much stronger.

This double-whammy of being locked in via the way your code runs and unable to have a fair comparison since your code couldn't run on literally any other platform reminds me of the idea of "mainframes". When companies would get locked into their physical hardware seller (e.g. IBM or Oracle), since the hardware would only be able to run vendor-specific code, making the company writing more vendor-specific code, leading to a vicious cycle.

That is not to say comparison becomes impossible in these situations, but it becomes much harder, it requires a re-write of the codebase, and it comes with compromises. Maybe AWS compute is cheaper but their NLP services are a bit worst, and then you have to quantify the cost of using a cross-platform solution or losing a slight edge in terms of an NLP component.

V - Lock in via misinformation

I also believe lock-in happens because of disinformation campaigns. A good example is "serverless infrastructures". The claim AWS makes (citing them since they are the first to introduce the concept) is:

Modern applications are built serverless-first, a strategy that prioritizes the adoption of serverless services, so you can increase agility throughout your application stack. We’ve developed services services for all three layers of your stack: compute, integration, and data stores

In other words:

Modern applications are not built to run on an operating system, instead they are built to run on 3 AWS-specific abstraction layers, compute, integration, and data store. This means you'd have to redesign your entire codebase to use any other hardware provider.

Serverless is a misnomer used to induce the idea that you are no longer dependent on hardware, when instead you are abandoning the ability to run on 99.9% of the world's hardware, in favor of only being able to run on AWS.

Or to take another example, let's look at AWS aurora:

combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.

In other words:

combines the ability of traditional enterprise databases to keep increasing prices due to vendor lock-in, but is based on code from open source projects in order to provide a similar interface that gives you the allure of thinking you can migrate your data and queries from it as easily as you could from Postgres or MySQL.

At the same time, GC, AWS, and Azure try to drive away language from any comparable metrics that one might use to quantify their machines.

What's your memory? A number of GBs. Wanna know the interface (i.e. ddr4, ddr3... etc)? The latency? The frequency? If it's EEC? Tough luck.

What's your CPU capacity? A number of "virtual cores". Wanna know the L1/L2/L3 size? If it's shared? The behavior of the hypervisor being used? If you can change the scheduler? The frequency? The instruction set? Tough luck.

What's your storage? An SSD with a number of GBs. What SATA generation does it uses? Is it NVME? Are the IO/s numbers a function of cached reads or cold reads? Tough luck, better run the benchmarks.

And on providers like AWS, in my own experience, the numbers are never the same. I noticed over 2x differences in read and write speeds for EBSes with the same specs, attached to machines with the same specs, using basic hdpram and dd based benchmarks.

The variable performance is not just something I observed either, Netflix has a whole instance selection process for AWS in order to grab better EC2 VMs.

I'm not saying hardware specs are unavailable, if you dig deep enough and ask support you can get everything you need. You can find out most things by just looking at the actual machine once you purchase it. But they are hidden, they aren't plastered on the walls as a selling point, they are a dirty little secret that most popular infrastructure providers would rather that you ignore.

Even better, most of them would rather that you don't even use that language anymore. Hence why "core" and "CPU" has been replaced by "vCPU" or even something more abstract like "compute unit".

VI - Popularity and cost increase

The final reason I would cite here is the fact that popularity can obviously lead to costs increasing. For all the above reasons, popularity ends up being a big factor.

The more popular you are, the more companies are locked into your ecosystem, the more they will draw their subsidiaries and contractors into it.

The more popular you are, the more developers know how to use your system, the stronger the employee lock-in effect becomes.

The more those two things are true, the more you can abstract away basic infrastructure as "services" to deny easy comparisons of price and make migrating code away harder.

The more those three things are true, the more you can "educate" people away from the very language that would allow them to compare your service against others on unfavorable grounds. To pull the very ideas needed for comparison out of the vocabulary people use when comparing infrastructure providers. To steal concepts that could be used against your services and rephrase them to make them seem like one of the advantages your services bring.

VII - No Monopoly

Given the economy of scale that applies on top of all that, I'd expect more of a monopoly in the infrastructure provisioning space. Which is to say that I am pleasantly surprised by the current state of it all.

Finding relevant numbers on this is difficult, but based on various metrics it seems that the war has hardly been won by anyone.

Even in the most monopoly-prone field of "infrastructure as a service", AWS dominates a bit under 50% of the market, with no-name companies ( < 1.8% market share) taking 23.2% of the pie.

I can think of plenty of small companies making a big impact into markets I'd have imagined forever lost to monopoly. An interesting example here being the recent domination of the CDN space by Fastly.

Even more so, while I usually harp on about solutions like Kubernetes being a step in the wrong direction. I must admit that I've started seeing it used more and more as a solution for infrastructure agnostic deployments.

Add to that orchestration tools like Terraform and software such as serverless, designed to abstract away over "serverless" services provided by various IaaS providers.

On the one hand, billions are probably getting wasted on overpriced infrastructure, on the other hand, it seems like a software environment meant to make cost-cutting on infrastructure is alive and kicking, growing bigger every day.

In part, I have to think this can be attributed to the constant desire of programmers to experiment with new stuff, to write new abstractions, to try to out-do each other.

VIII - Benchmarking instead of anecdotes

We're already past the point where it's impossible to say that an IaaS provider is more efficient than the other.

The cleanest example is comparing servers with similar OSes and CPUs with similar instruction sets, like my original example. But what about storage? What about CDNs? What about GPUs? What about FPGAs? What about services that bundle a bunch of those together seamlessly?

It becomes really complicated. In my own experience, I was often flabbergasted by the difference between benchmarks with my own workload when compared to the claims of some hardware providers. A vivid example that comes to mind was comparing AWS's own Aurora database versus an RDS Postgres cluster of the same price... and finding out Aurora was ~30% slower. In a comparison where I was sure that, since AWS was holding all the cards, their custom database solution would win.

The problem with benchmarks, however, is that they are very workload-specific. I can go to almost any company and claim with a straight face that they are probably using an over infrastructure provider, and moving their compute and/or storage heavy workloads to another would save costs. However, I can't always make that same claim if my hourly rate for designing the benchmarks is included in that cost-benefit analysis.

That is, in large part, why I am somewhat against "infrastructure as a service", past some level of abstraction benchmarking becomes unfeasible, it involves redesigning the whole codebase. Once that happens we reach javascript land, where objective measures are replaced by marketing and hype.

IX - In conclusion

I think that I can paint a pretty bleak picture in terms of money being wasted on overpriced infrastructure.

One where new companies get lured away into the nets of inefficient hardware providers by unreasonably high offers of free credits. This then leads to them recruiting people familiar with those hardware providers. Potentially getting further locked-in with more free credits and seminars on various provider-specific services that hook into their codebase.

The cost of the infrastructure providers finally becomes apparent once the company scales, but at that point moving away is impossible. Not only is everyone in the company an expert in the hardware provider they chose, not only is the hardware provider embedded into every facet of their codebase, but the very language that one could use to make a comparison is taken away, and the possibility of benchmarking on a different infrastructure is an endeavor akin to rewriting everything from scratch.

On the other hand, I think that reality clearly disagrees with this bleak picture, when aggregated, since new infrastructure companies are alive and kicking, potentially now more than ever.

Not only competitors to the big 4 (5? 6?), but also companies trying to open up whole new niches. To top that off, no matter how convoluted IaaS gets, some madmen are always willing to build a provider agnostic abstraction for it. Couple that with migration tools and the rosy tints are returning to my glasses.

I do have a hunch that, for many businesses, a billion-dollar worth of server might be lying on the ground. But I'm not sure what it would take for them to pick it up.

Having a culture of benchmarking is the obvious solution, but that's been the obvious solution to almost all engineering problems for the last 100 years, and somehow it never sticks.

Having a team with diverse know-how is another, but this can backfire into a team that's so passionate about hardware they build a convoluted mess. Even if that doesn't happen, this hypothetical team might be so expensive as to nullify the savings.

Another obvious answer is that smaller, nimbler companies can pick up the slack and get a piece of the market share. Which indicates an interesting opportunity, of creating companies focused around not buying into first-hit-is-free IaaS and instead finding optimal solutions for infrastructure to give them an edge. I've even worked at some of these companies, but the end result seems to be that said edge is not enough, and recruiting people that can maintain the infrastructure becomes a significant problem.

If I thought the whole "beat out the big guys by using cheap compute" thing was so easy, I'd be doing it instead of writing about it.

So I must end this article on an uncertain note, unsure that I was able to convey a solution to anyone's problem, but maybe with the hope that I've outlined an interesting perspective regarding the cost-benefit gap in infrastructure services.

Related articles:

Published on: 2020-11-02










Reactions:

angry
loony
pls
sexy
straigt-face

twitter logo
Share this article on twitter
 linkedin logo
Share this article on linkedin
Fb logo
Share this article on facebook