1. The EE problem in recommender systems
Exploration and Exploitation (EE) is a common problem in computational advertising and recommendation systems. Why does the EE problem exist? Simply put, it's about balancing the accuracy and diversity of recommendation systems.
In the EE problem, exploration means that if a user has a relatively clear interest, then of course you should exploit and cater to it, just like you should spend the money you've already earned. On the other hand, exploration means that if you only use something that users already know about, they will quickly get bored, so you need to constantly explore new interests for them. This is like having some money to spend, but you still have to keep working to earn more, otherwise you'll be starving when you run out of money.
2. Bandit Algorithm
The Bandit algorithm is an effective algorithm for solving the Exact-Earn Problem. Let's first understand the origin of the Bandit algorithm. The Bandit algorithm originates from the long-standing field of gambling, and the problem it aims to solve is as follows:
A gambler goes to a casino to try out a slot machine. He sees a row of slot machines that all look identical, but each machine has a different probability of paying out money. He doesn't know the probability distribution of each machine. So, which machine should he choose each time to maximize his winnings? This is the Multi-armed Bandit Problem (MAB).
How do we solve this problem? The best way is to try things out, not blindly, but strategically and quickly. These strategies are the Bandit algorithm.
How does the Bandit algorithm relate to the Exclusivity (EE) problem in recommender systems? Suppose we've conducted some experiments and obtained the probability of each slot machine paying out. To maximize our winnings, we would keep trying the slot machine with the highest payout probability—this is exploration. However, the information we have isn't the true probability of the slot machine paying out; there might be better slot machines with even higher payout probabilities. Therefore, further exploration is needed—this is the exploration problem.
Next, let's take a look at some classic Bandit algorithm implementations, but we still need to supplement some basic knowledge.
3. Basic Knowledge
3.1 Accumulated Regret
The Bandit algorithm needs to quantify a core question: how much regret is there in making a wrong choice? Can we reduce the regret? Therefore, we have a metric for evaluating the Bandit algorithm: cumulative regret.
Here, t represents the round number, and r represents the reward. The first term on the right-hand side of the formula represents the expected maximum reward in round t, while the second term represents the reward obtained from the currently selected arm. Summing up the differences from each round gives the total regret.
For the same problem, if we conduct the same number of experiments using different bandit algorithms, then the algorithm that produces the slowest increase in total regret is the one that performs best.
3.2 Beta Distribution
For more information on the Beta distribution, please refer to this post: https://www.zhihu.com/question/30269898. Here, we will only provide a brief introduction. The Beta distribution can be viewed as a probability distribution of probabilities. It describes the probability distribution of the success probability p in a binomial distribution. Its form is as follows:
Here, a and b represent the number of successes and failures in a+b Bernoulli trials, respectively. The following diagram illustrates the meaning of the Beta distribution:
The graph above shows three lines. Ignoring the middle line, in the first line, a=81 and b=219. This means that in 300 Bernoulli trials, with 81 successes and 219 failures, the probability distribution of success (p) is as follows: The probability of success is highest around 0.27. However, we cannot say that the probability of success is exactly 0.27; this is the difference between frequentist and Bayesian approaches. Now, we conduct another 300 trials. In a total of 600 Bernoulli trials, we have 181 successes and 419 failures. The probability distribution of success (p) now resembles the blue line, with the highest probability around 0.3.
4. The principle and implementation of the classic Bandit algorithm
The payout mentioned below can be understood as the observed probability of a slot machine paying out money.
4.1 Naive Bandit Algorithm
First, try several times randomly, calculate the average return of each arm, and keep selecting the arm with the largest average return.
4.2 Epsilon-Greedy Algorithm
Choose a small number epsilon between (0,1). Each time, randomly select one arm from all arms with a probability of epsilon. Select the arm with the highest average return up to the current point with a probability of 1-epsillon. Update the expected return based on the return value of the selected arm.
Here, the value of epsilon can control the degree of preference for exploit and explore. Each decision is to explore with a probability of ε and to develop exploit with a probability of 1-ε. Based on the selected item and its reward, the expected reward of the item is updated.
The Epsilon-Greedy algorithm is capable of adapting to change; if the reward of an item changes, it can adjust its strategy promptly to avoid getting stuck in a suboptimal state. The Epsilon value controls the degree of preference for Exploit versus Explore. A value closer to 0 indicates a more conservative approach, prioritizing spending over earning. However, after the strategy has been running for a while, even with a certain understanding of each item, failing to utilize this information and continuing to indiscriminately explore randomly is a weakness of the Epsilon-Greedy algorithm.
4.3 Thompsonsampling Algorithm
The Thompsonsampling algorithm uses a Beta distribution. This method assumes that each slot machine has a probability p of paying out money, and that the probability distribution of p follows a beta(wins, lose) distribution. Each arm maintains a beta distribution parameter, namely wins and lose. After each trial, an arm is selected and shaken. If there is a payout, the wins of that arm is increased by 1; otherwise, the lose of that arm is increased by 1.
The method for selecting an arm each time is as follows: generate a random number b using the existing beta distribution of each arm, and select the arm with the largest random number among all the arms to shake.
4.4 UCB Algorithm
As mentioned earlier, the Epsilon-Greedy algorithm selects all slot machines with the same probability during exploration, which does not make full use of historical information, such as the number of times each slot machine has been explored before and the frequency of payouts during previous explorations.
So how can we make full use of historical information? First, based on the number of times the slot machine has been explored and the number of times it has paid out, we can calculate the observed probability p' of each slot machine paying out. At the same time, since the number of observations is limited, there will always be a difference ∆ between the observed probability and the true probability p, i.e., p'-∆<=p<=p'+∆.
Based on the above discussion, we arrive at another commonly used Bandit algorithm: the UCB (Upper Confidence Bound) algorithm. This algorithm optimistically assumes that each slot machine can yield a payoff of p' + ∆ during each recommendation.
Okay, the next question is how to calculate the difference ∆ between the observed probability and the true probability. We have two intuitive understandings: 1) For a selected slot machine, each additional feedback will decrease ∆. When there is an infinite number of feedbacks, ∆ approaches 0, and will eventually be less than the ∆ of other unselected slot machines. 2) For an unselected slot machine, ∆ will increase with the number of rounds, and will eventually be greater than the ∆ of other selected slot machines.
Therefore, after a certain number of rounds, each slot machine has a chance to explore. The formula for calculating p' + ∆ in the UCB algorithm is as follows:
The value before the plus sign is the average return of the j-th slot machine up to the present, and the value after the plus sign is called the bonus, which is essentially the standard deviation of the mean. T is the current number of trials, and n is the number of trials for this slot machine.
Why choose the above form of ∆? We need to start with the Chernoff-Hoeffding Bound:
Therefore (the screenshot below is from Zhihu https://zhuanlan.zhihu.com/p/32356077):
5. Code Implementation
Next, we will implement two basic Bandit algorithms, UCB and Thompsonsampling.
5.1 UCB Algorithm
The code has detailed comments, so I'll just paste the complete code here:
import numpy as np T = 1000 # T rounds of trials N = 10 # N slot machines true_rewards = np.random.uniform(low=0, high=1, size=N) # The actual payout probability of each slot machine estimated_rewards = np.zeros(N) # The observed payout probability of each slot machine, initially 0 chosen_count = np.zeros(N) # The current number of times each slot machine has been explored, initially 0 total_reward = 0 # Calculate delta def calculate_delta(T, item): if chosen_count[item] == 0: return 1 else: return np.sqrt(2*np.log(T)/chosen_count[item]) # Calculate p+delta for each slot machine and make a choice def UCB(t, N): upper_bound _probs=[estimated_rewards[item]+calculate_delta(t,item)foriteminrange(N)]item=np.argmax(upper_bound_probs)reward=np.random.binomial(n=1,p=true_rewards[item])returnitem,rewardfortinrange(1,T): # Perform T trials in sequence # Select a slot machine and get the result of whether it pays out item,reward=UCB(t,N)total_reward+=reward # How many customers accepted the recommendation # Update the payout probability of each slot machine estimated_rewards[item]=((t-1)*estimated_rewards[item]+reward)/tchosen_count[item]+=1
5.2 Thompsonsampling Algorithm
The Thompsonsampling algorithm involves a beta distribution, so we use the pymc library to generate random numbers that follow a beta distribution. Only one line of code is needed to select the appropriate slot machine.
np.argmax(pymc.rbeta(1+successes,1+totals-successes))