Trouble with Learning . . . consequences for AI Business

PG Madhavan
14 min readJul 4, 2019

--

In this article, we will see why ML or AI (which is applied ML today) cannot achieve perfection. This means AI solution developers and their clients should set their expectations lower; turns out having a “person in the loop” is how we can mitigate the AI risks. As Prof. Tom Malone of MIT Sloan explores in his AI for Business course, “How can people and computers be connected so that — collectively — they act more intelligently than any person, group or computer has ever done before?” — this leads to what he calls “More AI”. What we will find in this article is that “Person+AI” is the way forward . . .

After a second-order analysis of the process of machine learning, we discuss the business consequences and propose some practical mitigations.

We really don’t know how Learning works …

Learning is the fundamental building block of ML & AI. But do we know what learning really is, what the different types are and how they are interrelated?

When the current AI revolution began, a “learning machine” was defined thus — “A computer program that modifies its operations iteratively in runtime towards an objective can be considered a learning machine”. This “iterative modification” is otherwise known as learning . . . step-by-step learning as opposed to step-by-step instructions to directly achieve that objective.

AI has taken its cue from animal and human learning. Classical and operant conditioning learning are common to animals and people. AI shoots for higher forms of learning in humans — “cognitive” learning. This is the start of the problem — this higher form of human learning is hard to figure out; I am going to claim that it is IMPOSSIBLE!

In the first half of the 20th century, neurophysiologists such as Sherrington, Luria and Penfield made great progress in delineating areas of the human brain that correspond to higher-order human activities. It is important to note that the studies were conducted under extreme unnatural conditions — human subjects with bits of brain blown away in the wars or with severe brain disorders such as grand mal seizures. They exhibit far from NORMAL human brain activity! More recent imaging studies are also a bit of a red herring in that tasks that have to be repeated over many tens of minutes (whereas human brain processing has typical time constants of tens of milli seconds) can be misleading due to the “blurring effect” and that they track only certain molecules in the brain.

Taken together, it is fair to say that we will NEVER know how human beings learn since in-vivo studies in NORMAL humans are impossible to conduct. I am discounting the possibility of pure serendipity or shrewd guesses leading us to the prize . . .

So, for AI and ML, we have to live with learning approaches INSPIRED by human learning. There are three main types: (1) Unsupervised, (2) Supervised and (3) Reinforcement learning. I will not discuss Unsupervised below since it is more of a method to extract structure and patterns from data. The other two are actual “step-by-step modification of operations iteratively in runtime” — our definition of a Learning Machine.

All Supervised Learning is rooted in the methods of Steepest Descent and Least Squares (there is solid theory behind Steepest Descent using Jacobian and Hessian matrices of the error energy with respect to weights but does not fit with our step-by-step method).

In Steepest Descent, you are walking down a (error) hill and you can see ONLY one step ahead — this is the scenario. Where you are standing, you “fit a tangent”, find the magnitude of slope of the tangent, take a step in the negative (downward) direction proportional to the slope magnitude. Eventually, you may get to the very bottom — “Global Minimum”.

Without using the jargon of Reinforcement Learning (such as Policy, Value Function, etc.), we can approximately understand it within the same scenario. Instead of using one-step-ahead “blinkers”, you are allowed to look around, may be even take multiple steps in various directions and see if, over many steps, you reach a lower point on the hill; you may have to go UP a small mound to find the eventual lower point in your local exploration; Reinforcement Learning is Steepest Descent but with more exploration. The approach is to “find a balance between exploration (of uncharted territory) and exploitation (of current knowledge)” which is the Wiikipedia definition of Reinforcement Learning.

Switching gears, let us explore in depth what was meant when I said in my recent blog, AI Business Transformation Playbook for Executives, that “Each trained ML model will produce slightly different results due to the stochastic nature of the learning procedure. This uncertainty can mask deeper non-robustness.”

We have to get a bit technical with Steepest Descent and related theory to see why ML Learning methods can lead to unintended consequences. Let me be clear that I am NOT condemning current learning methods — I am trying to level-set (to a lower point) our unspoken expectations of Accuracy and Robustness of our AI and ML solutions!

A Careful Look at Steepest Descent:

The objective of any learning method is to minimize “Error Energy”.

Error at any Training Step, ’n’ = e[n]

Squared error, E[n] = e[n] * e[n].

In a stochastic formulation where the inputs are random, Error Energy = Expected value of [Squared Error]

In the given data (deterministic case) where we have Training set in hand,

The “bottom of the hill” scenario corresponds to reaching the minimum of Eav during the Training stage.

Using the familiar Method of Steepest Descent, first we find the gradient of Eav w.r.t. to weights using Training data input-output pairs, one at a time.

NOTE: In what is called “batch” approach, we can find “ideal” slopes if we process ALL the Training set in one fell-swoop; but that is NOT step-by-step learning!

For step-by-step updates though, we are forced to use the instantaneous estimate of the slope of Eav at the current weight at the current Training step which is E[n], because average won’t be available till all training steps are complete! (Even though we have the whole Training Set in hand, we “make believe” that we do not have all the feature vectors at one time but that each feature vector is given to us one at a time — this approach is required to develop a step-by-step learning procedure; otherwise, you end up with the “block” solution). So we are forced to work with the slope of E[n] INSTEAD of the slope of Eav.

Now we can start diagnosing the issues created by the simplifying assumptions we made in the preceding steps.

We started with wanting to find the minimum of Mean Squared Error, Eav. But, since we wanted to do the minimization step-by-step, we had to use E[n]. Clearly, they are not the same — one is the AVERAGE of ALL the others; as we well know, the average being equal to each and every measurement is highly unlikely other than in the trivial case of all measurements being the SAME!

Conceptually, this is what is going on. Eav has a quadratic relationship with the weights, w, and hence the picture of Eav versus w will be a paraboloid. If there were only 2 weights, we can easily visualize the Eav shape as a “salad bowl” w.r.t. the 2 weights, w1 and w2.

Here is a nice idealized picture of Eav versus w for a 2-weight case. The relationship between the input and output is truly linear (and assumed as such in LINEAR least squares). Even in this nicely smooth “salad bowl” case,

we have the problem due to our use of E[n] instead of Eav, which we will explore further. But let us note now that if the input-output relationship was instead truly NON-linear, the salad bowl will be lined with crumpled-up aluminum foil, full of sharp bumps and crooked valleys! When we invoke Cover Theorem (see Page 60 of Systems Analytics) for the case of high-dimensional kernel regression or Deep Neural Networks, the aluminum foil does get smoothed out a bit but not as smooth as the linear case . . .

Consequences of the use of use of E[n] instead of Eav:

A graphical analysis can expose the issue most clearly.

I created a “Training Set” which has a bunch of “feature vectors” that are PURELY linearly related to output “class”. Then using the Method of Steepest Descent in the last section with a SINGLE weight, let us visualize the process (keeps the figures simple — the Error Energy surface is a parabola instead of the “Salad bowl”).

You pick a random weight to start — the RED diamond (seems like 0.8 or so on the x-axis from the figure). You use the FIRST Training Set input and output to get E[1] by changing the weight, w, over a range (-1 to +1 here) with the SAME Training Set pair to get the E[1] versus w, the “error energy surface” as shown above. Fit a tangent, find the its slope magnitude and take a step in the negative direction of the slope — this is shown as the dotted step from the RED diamond to the GREEN circle.

Taking the SECOND input-output pair from the Training set, you plot E[2] versus w. Here the step is taken NOT from a random point but from the w[1] (at the GREEN dot) that we found with the previous (first) pair of the Training set on E[1] plot. Start at that weight (0.55 approx. in the figure) BUT now on E[2] curve. E[1] and E[2] curves are very different (bottom left BLUE and RED) — fortunately, the minimum point seems to be the same . . . You keep repeating these steps and you reach the minimum error.

However, that is a fiction!

The inputs and outputs in the Training set (and Test set) will contain noise and measurement errors. Also, it is highly unlikely that the true relationship between input and output is STRICTLY linear. The middle and right of the panel shows the error energy surface corresponding to pairs of input-output when there is a small amount of (1) noise and (2) non-linearity, respectively. In the NOISE case, minimum error point is hard to tell (tentatively marked with a RED ‘X’) and in the non-linear case, the minimum will “jump around” for each E[n] for all n=1 to N, the size of the Training set! When the non-linearity is high, the shape can be highly warped with crevasses and bumps and flat regions on the error surface.

This is all known to Data Scientists; but we have to internalize this reality and design AI solutions that account for it unlike in lab experiments and PoCs where we can brush it under the rug!

Because we use of E[n] instead of Eav, the error surface we are trying to walk down moves around at each step of the Training. There has been a cottage industry developing better ways to step down the error surface (here is one reference: Intro to optimization in deep learning: Momentum, RMSProp and Adam). In all cases, explanations are based on Eav (or Etot) but we are NOT on that surface. The unstated assumption is that somehow when we use many E[n]’s, things get “averaged out”! In reality, instead of walking down an “average” surface, we are walking down a bunch of surfaces that jumps around and are warped from feature vector to feature vector input!

In the TEST set, when we get a new “feature vector” to classify, its OWN error surface, E[.], is, in all likelihood, different from the last Training set error surface; leading to the use of non-optimum weights from the Training Set during Test and production runs!

But then, the Method of Steepest Descent seems to work! Why does that happen?! It has to do with what we can call “regularity” or “homogeneity” or “stationarity” of the underlying system. If these assumptions hold approximately, the method will not blow up. Explicit non-linearity in the model is accommodated by the use of NON-linear regression models or recasting the problem as approximately linear (using high-dimensional projection) based on Cover Theorem.

The main issues for AI solution building are the following:

· We use E[n] instead of Eav; In a step-by-step learning solution, we have no other option!

· Noise that is ever present will prevent reaching global optimum weights.

· Small amount of non-linearity is ever present which will prevent reaching global optimum weights.

To address the first issue, we can use block methods! All the Training set data are available to you at the outset anyway; you can use them all to minimize Eav or some better estimates of Expected value of [Squared Error]. Then we can use these weights for Test and production. It is true that block methods will become intractable when there are millions upon millions of weights to optimize. However, there are many situations in ML where the number of weights is only in the 1000’s or 10’s of thousands. Why not use block methods in those cases; computer systems with large memory capacity are not that rare anymore!

The only case where step-by-step learning becomes ESSENTIAL is when the data are NON-stationary — the underlying system is changing from sample to sample! For that we have “Dynamical ML”.

Another avenue may be to develop Viterbi algorithm inspired approximate maximum likelihood method to figure out the next state (=step) to go to assuming some hidden models while also taking off the blinkers and looking around to find a better long-term payback, as in Reinforcement Learning. I won’t be surprised if someone has already done this — I welcome the reference!

Business Impact of Learning Inaccuracies:

In the past two sections, we have described the following pessimistic stories:

· No breakthroughs copying human learning are imminent! We can NEVER know how human beings learn since in-vivo studies in NORMAL humans are impossible to conduct.

· Current step-by-step learning methods can never be 100% accurate due to –

o Use E[n] instead of Eav — too much step-to-step variability.

o Noise that is ever present will prevent reaching global minimum error.

o Small amount of non-linearity is always present (even in high-dimensional projection approaches) which will prevent reaching global optimum weights.

But all is not lost! In certain ML use cases, we can have 80–90% solutions . . . In business applications, it is a matter of accounting for this imperfection, YET making use of the 80% solutions. We have to be clear-eyed about the limitations to know the extent to which we can fully exploit the business value in ML and AI.

It should be noted that 90% or higher levels of accuracy is not necessary in all business situations. Think of a recommendation engine that proffers ads to consumers — if I get a few ads that are irrelevant, nothing much is lost; the ML algorithm need not be super-accurate to still get some business value.

In other use cases where many errors will doom the service, we have to commit to the need of a “human in the loop” — this is not a bug; it is a welcome feature! We also have to help the human in the loop “curate” ML results of classification or prediction.

Guidelines for AI System Solutions:

AI solution as a “System” is important for business value creation; point solutions will remain as cool demos or PoCs and not rise to the level where a business will rely on AI.

As described in detail in my recent article, “AI Business Transformation Playbook for Executives”, here is a concept diagram of an AI *System* solution with the business objective of “Do more at higher quality with better UX”. What I am advocating is a human “AI Curator” in the ERP-Analytics portion who asses AI prescriptions.

This requires that each ML output (class membership or a prediction) comes with a Figure of Merit. If it is a statistical quantity, instead of a point estimate, a confidence interval can be reported. Considering that even professional statisticians have a hard time interpreting confidence intervals sometimes, it is better to come up with a Figure of Merit (FoM) that ranges from 0 to 1 (like a probability measure).

FoM > 0.9, no review necessary before next stage of processing.

FoM > 0.7 and < 0.9, do random spot checks.

FoM < 0.7, check every time and bring in the responsible Data Scientist for a deeper dive.

FoM should have its own “Operations” dashboard separate from Analytics dashboards. FoM and tolerances should be decided on the basis of the use case; for example, if the ML prediction is the “time to failure” of a machine part, FoM will have to be based on Weibull probability distribution, and so on.

SUMMARY

When we take a careful look at Learning in Supervised ML, we find that errors will always persist in production use. We also do not see any radical improvement that can come from better understating of how humans learn.

· When the dimensionality of the problem is manageable, do not hesitate to use block algorithms. Taking inspiration from Viterbi algorithm, some improvement may occur in Supervised ML.

· Accommodate the problem of errors rather than condemn ML. ML can help simplify many tasks within a job that will lead to practical automation.

· Businesses need AI solutions implemented as an overall System. This system should insert a “human curator” who will moderate the inevitable errors that ML will make in production use.

· Figures of merit need to be developed, relevant to the use case so that a non-specialist Human Curator can perform the basic job of catching major “gotchas”.

· Automation based on AI will still need humans in the loop to perform at levels acceptable to most businesses — people jobs will change but not disappear altogether which is a GOOD thing!

The work of demos and PoCs should continue. But we need a new crop of AI Solutioning experts who have internalized the essential fallibility of ML and can still craft AI systems with humans in the loop and deploy the virtuous cycle of delivering services that can “Do more at higher quality with better UX”!

………………………………………………………………………………………Dr. PG Madhavan’s career in corporate technology includes developing multiple AI and ML startups for NEC X, Inc., a subsidiary of NEC and product leadership roles at Microsoft, Bell Labs, Rockwell Automation and GE Aviation. PG founded and was CEO at 2 startups (and CTO at 2 others) leading all aspects of startup life.

He has led the development of large-scale ML products at major corporations (GE Aviation, Rockwell Automation and NEC as well as other software solutions at Microsoft and Lucent) and startups (Syzen Analytics, NEC startups and Global Logic) involving ML algorithms to cloud software development to business operations in manufacturing, telecom, retail, advertising, medical, and semiconductor industries.

After obtaining his Ph.D. in Electrical and Computer Engineering from McMaster University, Canada, and Masters in Biomedical Engineering from IIT, Madras, PG pursued original research in Random Field Theory and Computational Neuroscience as a faculty member at University of Michigan, Ann Arbor, and Waterloo University, Canada, among others.

PG’s recent major contribution in Data Science is the creation of “Systems Analytics”, a blend of Systems Theory and Machine Learning (published in 2016) providing a pathway to formally incorporate “dynamics’ into Machine Learning.

https://www.linkedin.com/in/pgmad

--

--

PG Madhavan
PG Madhavan

Written by PG Madhavan

Causal digital Twin, IoT, Algorithms

No responses yet