When to assume neural networks can solve a problem

A pragmatic guide

The question: “What are the problems we should assume can be solved with machine learning?”, or even narrower and more focused on current developments “What are the problems we should assume a neural network is able to solve?”, is one I haven’t seen addressed much.

There are theories like PAC learning and AIX which at a glance seem to revolve around this, as it pertains to machine learning in general, but if actually applied in practice won’t yield any meaningful answers.

However, when someone asks me this question about a specific problem, I can often give a fairly reasonable confidence answer provided I can take a look at the data.

Thus, I thought it might be helpful to lay down the heuristic that generates such answers. I by no means claim these are precise or evidence-based in the scientific sense, but I think they might be helpful, maybe even a good start point for further discussion on the subject.

1. A neural network can almost certainly solve a problem if another ML algorithm has already succeeded.

Given a problem that can be solved by an existing ML technique, we can assume that a somewhat generic neural network, if allowed to be significantly larger, can also solve it.

For example, playing chess decently is also a problem already solved. It can be done using small decision trees and a few very simple custom search heuristics (see for example https://github.com/AdnanZahid/Chess-AI-TDD). So we should assume that a fairly generic neural network, using significantly more parameters than the original DT based models, can be trained to play chess equally well.

Indeed, this seems to be the case, you can play chess using a fairly generic fully connected network, without requiring any additional built-in heuristics or an architecture specialized for this purpose.

Or, take any “toy” datasets such as those on UCI, find the best fitting classical ML model using a library like Sklearn, then try using the fairly simple neural network that Sklearn provides, if you set the _hidden_layer_sizes to be large enough, you will almost certainly get matching performance no matter what model you are comparing against.

This assumption doesn’t always hold because:

a) Depending on the architecture, a neural network could easily be unable to optimize a given problem. E.g. Playing chess might be impossible for a convolutional network with a large window and step size, even if it’s very big.
b) Certain ML techniques have a lot of built-in heuristics that might be hard to learn for a neural network. The existing ML technique mustn’t have any critical heuristics built into it, or at least you have to be able to include the same heuristics into your neural network model.

As we are focusing mainly on generalizable neural network architectures (e.g. a fully connected net, which is what most people think of initially when they hear “neural network”), point a) is pretty irrelevant.

Given that most heuristics are applied equally well to any model, even for something like chess, and that size can sometimes be enough for the network to be able to just learn the heuristic, this rule basically holds almost every time.

I can’t really think of a counter-example here… Maybe some specific types of numeric projections?

This is a rather boring first rule, yet worth stating as a starting point to build up from.

2. A neural network can almost certainly solve a problem very similar to ones already solved

Let’s say you have a model for predicting the risk of a given creditor based on a few parameters, e.g. current balance, previous credit record, age, driver license status, criminal record, yearly income, length of employment, {various information about current economic climate}, number of children, marital status, porn websites visited in the last 60 days.

Let’s say this model “solves” your problem, i.e. it predicts risk better than 80% of your human analysts.

But GDPR rolls along and you can no longer legally spy on some of your customers’ internet history by buying that data. You need to build a new model for those customers.

Your inputs are now truncated after and the customer’s online porn history is no longer available (or rather admittedly usable).

Is it safe to assume you can still build a reasonable model to solve this problem?

The answer is almost certainly “yes; given our knowledge of the world, we can safely assume someone’s porn browsing history is not that relevant to their credit rating as some of those other parameters.

Another example: assume you know someone else is using a model, but their data is slightly different from yours.

You know a US-based snake-focused pet shop that uses previous purchases to recommend products and they’ve told you it’s done quite well for their bottom line. You are a UK-based parrot-focused pet shop. Can you trust their model or a similar one to solve your problem, if trained on your data?

Again, the right answer is probably “yes”, because the data is similar enough. That’s why building a product recommendation algorithm was a hot topic 20 years ago, but nowadays everyone and their mom can just get a WordPress plugin for it and get close to Amazon’s level.

Or, to get more serious, let’s say you have a given algorithm for detecting breast cancer that — if trained on 100,000 images with follow-up checks to confirm the true diagnostics — performs better than an average radiologist.

Can you assume that, given the ability to make it larger, you can build a model to detect cancer in other types of soft tissue, also better than a radiologist?

Once again, the answer is yes. The argument here is longer, because we aren’t so certain, mainly because of the lack of data. I’ve spent more or less a whole article arguing that the answer would still be yes.

In NLP the exact same neural network architectures seem to be decently good at doing translation or text generation in any language, as long as it belongs to the Indo European family and there is a significant corpus of data for it (i.e. equivalent to that used for training the extant models for English).

Modern NLP techniques seem to be able to tackle all language families, and they are doing so with less and less data. To some extent, however, the similarity of the data and the number of training examples are tightly linked to the ability of a model in quickly generalizing for many languages.

Or looking at image recognition and object detection/boxing models, the main bottleneck is large amounts of well-labeled data, not the contents of the image. Edge cases exist, but generally all types of objects and images can be recognized and classified if enough examples are fed into an architecture originally designed for a different image task (e.g. a convolutional residual network designed for Imagenet).

Moreover, given a network trained on Imagenet, we can keep the initial weights and biases (essentially what the network “has learned”) instead of starting from scratch, and it will be able to “learn” on different datasets much faster from that starting point.

3. A neural network can solve problems that a human can solve with small-sized datapoints and little to no context

Let’s say we have 20x20px black and white images of two objects never seen before; they are “obviously different”, but not known to us. It’s reasonable to assume that, given a bunch of training examples, humans would be reasonably good at distinguishing the two.

It is also reasonable to assume, given a bunch of examples (let’s say 100), that almost any neural network of millions of parameters would ace this problem like a human.

You can visualize this in terms of amounts of information to learn. In this case, we have 400 pixels of 255 values each, so it’s reasonable to assume every possible pattern could be accounted for with a few million parameters in our equation.

But what “small datapoints” means here is the crux of this definition.

In short, “small” is a function of:

The size of your model. The bigger a model, the more complex the patterns it can learn, the bigger your possible inputs/outputs.
The granularity of the answer (output). E.g 1,000 classes vs 10 classes, or an integer range from 0 to 1,000 vs one from 0 to 100,000. In this case 2.
The size of the input. In this case 400, since we have a 20x20 image.

Take a classic image classification task like MNIST. Although a few minor improvements have been made, the state-of-the-art for MNIST hasn’t progressed much. The last 8 years have yielded an improvement from ~98.5% to ~99.4%, both of which are well within the usual “human error range”.

Compare that to something much bigger in terms of input and output size, like ImageNet, where the last 8 years have seen a jump from 50% to almost 90%.

Indeed, even with pre-CNN techniques, MNIST is basically solvable.

But even having defined “small” as a function of the above, we don’t have the formula for the actual function. I think that is much harder, but we can come up with a “cheap” answer that works for most cases — indeed, it’s all we need:

A given task can be considered small when other tasks of equal or larger input and output size have already been solved via machine learning with more than one architecture on a single GPU

This might sound like a silly heuristic, but it holds surprisingly well for most “easy” machine learning problems. For instance, the reason many NLP tasks are now more advanced than most “video” tasks is size, despite the tremendous progress on images in terms of network architecture (which are much closer to the realm of video). The input & output size for meaningful tasks on videos is much larger; on the other hand, even though NLP is in a completely different domain, it’s much closer size-wise to image processing.

Then, what does “little to no context” mean?

This is a harder one, but we can rely on examples with “large” and “small” amounts of context.

Predicting the stock market likely requires a large amount of context. One has to be able to dig deeper into the companies to invest in; check on market fundamentals, recent earning calls, the C-suite’s history; understand the company’s product; maybe get some information from its employees and customers, if possible, get insider info about upcoming sales and mergers, etc.

You can try to predict the stock market based purely on indicators about the stock market, but this is not the way most humans are solving the problem.

On the other hand, predicting the yield of a given printing machine based on temperature and humidity in the environment could be solved via context, at least to some extent. An engineer working on the machine might know that certain components behave differently in certain conditions. In practice, however, an engineer would basically let the printer run, change the conditions, look at the yield, then come up with an equation. So given that data, a machine learning algorithm can also probably come up with an equally good solution, or even a better one.

In that sense, an ML algorithm would likely produce results similar to a mathematician in solving the equation, since the context would be basically non-existent for the human.

There are certainly some limits. Unless we test our machine at 4,000 C the algorithm has no way of knowing that the yield will be 0 because the machine will melt; an engineer might suspect that.

So, I can formulate this 3rd principle as:

A generic neural network can probably solve a problem if:

A human can solve it
Tasks with similarly sized outputs and inputs have already been solved by an equally sized network
Most of the relevant contextual data a human would have are included in the input data of our algorithm.

Feel free to change my mind (with examples).

However, this still requires evaluating against human performance. But a lot of applications of machine learning are interesting precisely because they can solve problems humans can’t. Thus, I think we can go even deeper.

4. A neural network COULD solve a problem when we are reasonably sure it’s deterministic, we provide any relevant context as part of the input data, and the data is reasonably small

Here I’ll come back to one of my favorite examples — protein folding. One of the few problems in science where data is readily available, where interpretation and meaning are not confounded by large amounts of theoretical baggage, and where the size of a datapoint is small enough based on our previous definition. You can boil down the problem to:

Around 2,000 input features (amino acids in the tertiary structure), though this means our domain will only cover 99.x% of proteins rather than literally all of them.
Circa 18,000 corresponding output features (number of atom positions in the tertiary structure, aka the shape, needing to be predicted to have the structure).

This is one example. Like most NLP problems, where “size” becomes very subjective, we could easily argue one-hot-encoding is required for this type of inputs; then the size suddenly becomes 40,000 (there are 20 proteinogenic amino acids that can be encoded by DNA) or 42,000 (if you care about selenoproteins and 44,000 if you care about niche proteins that don’t appear in eukaryotes).

It could also be argued that the input & output size is much smaller since in most cases proteins are much smaller, thus we can mask & discard most of the inputs & outputs for most cases. Still, there are plenty of tasks that go from an, e.g. 255x255 pixel image to generate another 255x255 pixel image (style alternation, resolution enhancement, style transfer, contour mapping… etc). So based on this I’d posit the protein folding data is reasonably small and has been for the last few years.

Indeed, resolution enhancement via neural networks and protein folding via neural networks came about at around the same time (with every similar architecture, mind you). But I digress; I’m mistaking a correlation for the causal process that supposedly generated it. Then again, that’s the basis of most self-styled “science” nowadays, so what is one sin against the scientific method added to the pile?

Based on my own fooling around with the problem, it seems that even a very simple model, simpler than something like VGG, can learn something ”meaningful” about protein folding. It can make guesses better than random and often enough come within 1% of the actual position of the atoms, if given enough (135 million) parameters and half a day of training on an RTX2080. I can’t be sure about the exact accuracy, since apparently the exact evaluation criterion here is pretty hard to find and/or understand and/or implement for people that aren’t domain experts… or I am just daft, also a strong possibility.

To my knowledge the first widely successful protein folding network AlphaFold, whilst using some domain-specific heuristics, did most of the heavy lifting using a residual CNN, an architecture designed for categorizing images, something as widely unrelated with protein folding as one can think of.

That is not to say any architecture could have tackled this problem as well. It rather means we needn’t build a whole new technique to approach this type of problem. It’s the kind of problem a neural network can solve, even though it might require a bit of looking around for the exact network that can do it.

The other important thing here is that the problem seems to be deterministic. Namely:

a) We know peptides can be folded into proteins, in the kind of inert environment that most of our models assume, since that’s what we’ve always observed them to do.
b) We know that amino acids are one component which can fully describe a peptide
c) Since we assume the environment is always the same and we assume the folding process itself doesn’t much alter it, the problem is not a function of the environment (note, obviously in the case of in vitro folding, in vivo the problem becomes much harder)

The issue arises when thinking about b), that is to say, we know that the universe can deterministically fold peptides; we know amino acids are enough to accurately describe a peptide. However, the universe doesn’t work with “amino acids”, it works with trillions of interactions between much smaller particles.

So while the problem is deterministic and self-contained, there’s no guarantee that learning to fold proteins doesn’t entail learning a complete model of particle physics that is able to break down each amino acid into smaller functional components. A few million parameters wouldn’t be enough for that task.

This is what makes this 4th most generic definition the hardest to apply.

Some other examples here are things like predictive maintenance where machine learning models are being actively used to tackle problems humans can’t, at any rate not without mathematical models. For these types of problems, there are strong reasons to assume, based on the existing data, that the problems are partially (mostly?) deterministic.

There are simpler examples here, but I can’t think of any that, at the time of their inception, didn’t already fall into the previous 3 categories. At least, none that aren’t part of reinforcement learning.

The vast majority of examples fall within reinforcement learning, where one can solve an impressive amount of problems once they are able to simulate them.

People can find optimal aerodynamic shapes, design weird antennas to provide more efficient reception/coverage, beat video games like DOT and Starcraft which are exponentially more complex (in terms of degrees of freedom) than chess or Go.

The problem with RL is that designing the actual simulation is often much more complicated than using it to find a meaningful answer. RL is fun to do but doesn’t often yield useful results. However, edge cases do exist where designing the simulation does seem to be easier than extracting inferences out of it. Besides that, the more simulations advance based on our understanding of efficiently simulating physics (in itself helped by ML), the more such problems will become ripe for the picking.

In conclusion

I’ve attempted to provide a few simple heuristics for answering the question “When should we expect that a neural network can solve a problem ?”. For which problems should our default hypothesis include their solvability, given enough architecture searching and current GPU capabilities?

To recap, when can neural networks solve your problem?

[Almost certainty] If other ML models already solved the problem.
[Very high probability] If a similar problem has already been solved by an ML algorithm, and the differences between that and your problem don’t seem significant.
[High probability] If the inputs & outputs are small enough to be comparable in size to those of other working ML models AND if we know a human can solve the problem with little context besides the inputs and outputs.
[Reasonable probability] If the inputs & outputs are small enough to be comparable in size to those of other working ML models AND we have a high certainty about the deterministic nature of the problem (that is to say, about the inputs being sufficient to infer the outputs).

I am not sure about any of these rules, but this comes back to the problem of being able to say something meaningful. PACL can give us almost perfect certainty and is mathematically valid but it breaks down beyond simple classification problems.

Coming up with these kinds of rules doesn’t provide an exact degree of certainty and they are derived from empirical observations. However, I think they can actually be applied to real-world problems.

Indeed, these are to some extent the rules I do apply to real-world problems, when a customer or friend asks me if a given problem is “doable”. These seem to be pretty close to the rules I’ve noticed other people using when thinking about what problems can be tackled.

I’m hoping that this could serve as an actual practical guide for newcomers to the field, or for people that don’t want to get too involved in ML itself but have some datasets they want to work on.

If you enjoyed this article you might also like:

Published on: 1970-01-01

Reactions: