March 14, 2020

What is reinforcement learning?

We can distinguish three main techniques currently used in machine learning:

Supervised learning
- We input to software a set of examples, which are labelled, therefore we ‘explain’ exactly how they should be interpreted
- Based on this, some generalised observations about the data can be made
- By extrapolation, software can ‘explain’ future, unknown data
- This approach is widely used, for example for image-based automatic data analysis

Unsupervised learning
- We input the data to the software without any explanation
- This can be used to identify structures in complex data, for example for detection of unusual events, such as credit card fraud

Reinforcement learning (RL)
- Three basic components of RL are:
  - Action : resulting in a reward or penalty, depending on its effects on environment
  - Policy : the rules according to which the action is evaluated
  - Environment : external world affected by the action (which is described as the state of the system)

The objective of RL is to optimise the interaction with environment, in order to achieve desirable outcome. In mathematical terms, the software is trying to maximise the reward calculated based on the following data: the action taken and the result it triggered in the environment, under a given policy. For example:

When we play chess:

we take action (our next move)
which will change the environment (the game board)
this, based on a given policy (rules of the game) will give us a result (in this case we can win or lose). As with iterative gaming, sometimes the result given at the end of the game, is the only one what matters. In this case it is hard to estimate the reward from each single step, as there are many ways to win or lose a game and the final result is not apparent from the steps taken in the middle of the game. This is what actually makes the games exciting, when in-game twist occur. The whole art is to understand the interplay between the actions and their outcomes during the whole process of the game.

The similar happens during a biological process, which is however much more complicated. In this case the interplay between:

Our actions: adjusting the process conditions
The results of the actions: our process
Under the given policy: given by the laws of biology
Will give results: e.g. the yield or quality of a fermentation process.

The current progress in machine learning tools, with the introduction of deep learning, allows better modelling of the interplay between actions and their results under a given policy. Policy decides on how we evaluate our actions taken and how we choose our next actions. Biological systems are extremely complex, therefore modelling the interaction under the laws of biology is becoming possible only with the newest approximation methods, such as deep reinforcement learning.

Additional elements to this puzzle are advancements in sensor design which provide more accurate data on different aspects of the process under investigation, such as temperature, conditions of the living cells and concentrations of chemical components.

We believe that putting this puzzles together will improve the control over large scale processing of beer production, by providing recommendations on how to adjust the parameters of this complex and fascinating process used by humans since the beginning of civilization, with the current world production of 1.94 billion hectolitres per year.