The following knowledge might be a good starting point. It doesn't require much prior knowledge; high school math is sufficient. Just spend 15-20 minutes patiently thinking along with it, and it will surely help deepen your understanding of AI. So, let's begin:
The main value of machine learning lies in learning from experience (E) and then using that experience to perform a task (T), with the goal of optimizing performance (P) on task T. For example, in a bank, AI learns the relationship between customer behavior and creditworthiness based on data; this relationship is experience (E). Then, it calculates a more accurate credit card limit for each customer in real time—this is task T. The goal (P) is to increase the bank's credit card revenue within a certain risk tolerance range. (Because the previous one-size-fits-all approach to limit changes is far less efficient than the precise limits specified by AI.) Now that we understand what AI does, how does it do it?
Basic process of machine learning
The specific process is shown in the diagram below. Please understand what these seven steps do, and what the logic and timeline are. Subsequent explanations will focus on one or two of these steps. However, we often confuse them, for example, by separating AI training from AI task execution.
1. Choosing an algorithm. For example, in the bank example above, we first need to classify customers to find a blacklist—this is a classification algorithm. We also need to predict their credit limit level based on customer behavior—this is a regression algorithm. The discussion of Transformer in the previous article is a more recent and advanced algorithm.
2. Prepare high-quality data and perform feature engineering. This usually takes a lot of time, especially in industry where high-quality data is often unavailable. Data quality has four aspects: First, the absolute amount of data, which is easy to understand. Second, sample data. For example, in banking risk control, transaction data alone is not enough; real fraud data is essential. In equipment management, not only operational data but also fault data is needed for the machine to truly learn. Third, data processing efficiency. Real-time data is often the most valuable for AI, and it's key to maximizing the value of AI decision-making. Fourth, feature engineering. Data alone is insufficient; it must be processed to extract valuable features that the machine can understand. The simplest example is changing "male" and "female" to 0 or 1 respectively. Feature engineering is the most important area of AI computing; the entire deep learning neural network can be understood as performing feature engineering on data.
3. Train the data using algorithms. This training process is the key step that enables the machine to perform tasks, and the logic of many computational algorithms is designed for this part. When we say that computing power is a decisive factor, we are often referring to the computing power bottleneck in the training phase.
4. After training, many tests are usually conducted to ensure that the experience can cope with a variety of situations, thus truly forming experience E.
5. Experience E needs to be used in the production system to execute task T in real time. For example, in the example above, our core task is to dynamically adjust the credit limit for each user.
6. Evaluate performance. How much will AI improve performance compared to human, and are there any deviations from the plan?
7. Continuous Optimization. Optimization here is comprehensive, encompassing algorithm updates, improved data quality, better-fitting training results, and more real-time task execution. We often hear clients say that AI implementation with us might not be effective. In reality, AI implementations rarely yield good initial results; continuous optimization is essential for it to truly be effective. Many of Fourth Paradigm's AI-driven clients consider this a key core competency: the ability to continuously iterate.
The theoretical foundation of machine learning
Let's start by discussing how this experience E is trained. Suppose that this experience E can be abstracted into a linear relationship (of course, the real world isn't always a simple linear relationship; this is just a simplification), that is, let Y = f(x) = wx + b, where x is bank data (e.g., repayment delinquency rate), and Y is the credit limit. The experience E we want to learn is this f(x). This concept is very important, and I suggest you pause and take a moment to understand it. To obtain f(x), we need to first train it using some samples (that is, we know some x data and the corresponding y beforehand). For example:
| Credit limit | Loan delinquency rate |
| 100000 | 0% |
| 20000 | 5% |
| 120000 | 15% |
| 100000 | 12.4% |
| 80000 | 35% |
We hope to use these samples from x to y to deduce the linear function f().
As shown in Figure 1: In the space of the overdue repayment rate x and the credit card limit y, we have many samples, and what we need to predict is what the straight line y=wx+b will look like.
We can set a predetermined step size for the machine and exhaustively enumerate many possibilities for y = wx + b. But which one should we choose?
Here we need to introduce: the cost function J
A simple definition of the cost function: the average of the sum of distances from the line we predict to each sample.
Suppose we choose a straight line, or a f(), then given X(1), y(1) is the numerical value corresponding to the sample.
This is the value obtained from f(). M is the number of samples.
This function is the average of the sum of distances from the predicted line to each sample. In other words, we should find a line that minimizes the average distance to each sample, corresponding to the lowest cost function.
This might be a bit confusing for some of you, so let's break it down:
Here are a few concepts: real-world events are recorded by data, and this recording is then used for...
In terms of samples, do these three levels gradually narrow down? Further on, we train a y=wx+b, with the goal of using sample data to train a pattern that can predict the real world. In short, the first thing is to train many y=wx+b examples, the second is to compare which y=wx+b is better, and the third is to select the best one.
In practice, the computer calculates various possible values of f() and the corresponding cost function. This cost function, as shown in Figure 3, is a curved sphere. The machine uses rules to find the lowest point of that valley, which is the point where the cost function has the smallest value. The f() corresponding to this point is the one we are looking for. This method is called gradient descent.
Figure 3
Therefore, the process of a computer learning experience E is to predict what the overall value f() will look like, given a subset of x and y data (samples). The method used is gradient descent to find the f() that minimizes the cost function.
Gradient descent works by pre-setting a learning rate. The learning rate tells the computer at what frequency to generate multiple equations (f()) for comparison. If f() is a linear equation, one learning rate represents the change in the slope. For example, if the machine predicts the equation y=3x, and the learning rate is 0.1, the next prediction will be y=3.1x. Machine learning works by calculating a cost function for each equation derived and then using the gradient descent principle to find the equation that minimizes the cost function.
After calculating using gradient descent, the optimal f() is selected, and the computer can then complete task T. In this example, if new overdue payment rate data is available, the machine can predict the corresponding credit card limit based on f(), thus providing a more reasonable credit limit for bank customers. This improves the efficiency of bank customer service and leads to increased revenue related to credit cards.
The mathematical foundation of machine learning—vectors
In the previous example, X was the delinquency rate, meaning there was only one variable. However, in real life, credit limits are influenced by more than just one variable. There are also many other characteristics such as gender, age, region, annual salary, savings amount, credit card transaction amount, and number of defaults, as shown in the table below (Figure 4).
Figure 4 shows two samples, each a set of data representing a basic profile and transaction behavior of a bank customer. A bank customer's credit card limit is determined by a combination of this information. Each row of data here is called a vector; it can be represented as:
Each vector contains various feature data. A vector is a point in a high-dimensional space; in this example, this high-dimensional space is composed of these eight feature categories. Compared to y=f(x) in two dimensions, y=f() needs to be calculated in this high-dimensional space. This transformation from y=f(x) to y=f() is crucial and a concept that must be understood. The entire mathematical foundation of AI is built on the smallest unit of computation—the vector. Why look at the data in this higher dimension? There's a background to this: we want mathematical formulas to be linear equations rather than multivariate equations; only when data is placed in a very large dimension can linear relationships be more easily presented.
Advanced Machine Learning - Neural Networks
In addition, before we begin, it is necessary to reiterate the basic concepts of neural networks.
We define the sample data as: X(1) X(2) X(3) X(4)…X(i)
Each sample data point is represented as a vector; there are a total of i samples.
for example
That is, each sample has 8 features.
If we express y=f() in a different form, it looks like the diagram below.
That is, the vector eigenvalues of each X(1) are used in the calculation, and finally the cost function is optimized by gradient descent to obtain f().
So, a neural network adds several hidden layers to the computational foundation shown in Figure 5. Figure 6 shows a neural network with three hidden layers. The main function of a neural network is to further extract new features, especially those that are hidden or non-linear.
Let's take an example, still focusing on the credit limit calculation problem (see Figure 7). Imagine we design a hidden layer with four nodes: earning ability, repayment ability, loyalty to the bank, and credit habits. These four features are fields not provided in the previous data sample. Each of these nodes needs to be calculated against one of the eight familiar features to find their correlation, thereby obtaining a more accurate credit limit.
Earning ability, repayment ability, loyalty to the bank, and credit habits are artificially set by us for ease of understanding. In reality, machines can automatically calculate the corresponding possible hidden layers.
In many neural networks, the number of hidden layers can reach dozens. In a sense, neural networks further supplement the original features of data, finding those non-linear correlations hidden in the data and calculating them as new features to improve the model's capabilities—this is what we call deep learning. Deep learning is a branch of machine learning, but it has been widely applied in various industries.
Brief summary
At this point, let's briefly summarize the most basic concepts of machine learning:
① In the basic process of machine learning, we need to understand what training and execution are, and "continuous optimization" is of paramount importance; Fourth Paradigm has accumulated years of experience in this area, constantly learning from its mistakes. Future corporate governance structures will likely include core competency indicators, implemented through intelligent technologies; but the real barrier to entry lies in this "continuous optimization."
② The "training" part of machine learning is finding f(), that is, finding the optimal f() for the cost function using gradient descent. Understanding how to find f() is essential before discussing various algorithms.
③ Emphasize the concept of vectors. The mathematical background here is that we don't always want to deal with multivariate equations or nonlinear problems. We often put the data into a high-dimensional space, where we can always find linear relationships, although this increases the computational burden significantly. This new world of high-dimensional space is something we must understand, even though it's somewhat counterintuitive and doesn't have a physical mapping.
④ Neural networks further supplement and enhance data features given samples and vector features. Neural networks or deep learning are the future direction of development; we don't need to be ignorant of their significance, but we'll likely hear about them more and more often. Of course, explaining neural networks solely from the perspective of feature enhancement is far from sufficient, but it's at least a good starting point.
This is a rather bold way of introducing AI, quite different from classic textbooks. I want to reiterate that I am a complete novice, and this is just my experience after learning a little bit. There will be many errors and it is certainly not comprehensive. Perhaps in six months, I will have different insights and experiences, and I will supplement them for everyone then.