A Review of Deep Learning Methods for Text Sentiment Analysis

Text sentiment analysis aims to mine and analyze the opinions and emotions embedded in text, thereby improving performance in applications such as personalized services, recommendation systems, public opinion monitoring, and product research. From a machine learning perspective, text sentiment analysis can generally be transformed into a classification problem. The key to this lies in text representation, feature extraction, and classifier model building, with the construction of a sentiment feature dictionary being the most crucial aspect of traditional methods. In recent years, deep learning methods have made remarkable progress in many fields such as image and speech recognition. Compared to traditional machine learning methods, the biggest advantage of deep learning is its ability to automatically learn rich and effective features from large amounts of data samples, thus achieving better results. Existing research has shown that at the text representation level, word vector representation methods can acquire semantic, syntactic, and structural information of text, providing a solid foundation for sentiment analysis research and becoming a current research hotspot in this field. This paper first introduces the concept and problem classification of text sentiment analysis, reviews the relevant work of deep learning in text sentiment analysis, discusses in detail the text representation methods and deep learning models in text sentiment analysis, introduces the current problems of deep learning in text sentiment analysis applications, and looks forward to future research directions and trends in this field.

3.3 Character Level

Character stream-based text representation methods use the characters in a document as the initial input to the model, capturing the structural information of each word. This representation method is language-insensitive, as most texts are character-based, and even non-character texts (such as Chinese) have corresponding character representations (like Pinyin). Once the model is trained, it can easily handle texts from various languages. However, on the other hand, character-level representations struggle to capture the structural and semantic information of a document. Using this method requires more processing, leading to higher computational complexity.

In order to obtain higher-level document structure and semantic information, it is often combined with deep network models to take advantage of the powerful feature representation and layer-by-layer feature learning capabilities of deep networks. Zhang X et al. [103] were inspired by the Braille encoding method, and encoded documents based on characters. Then, they used deep convolutional neural networks to classify topics and perform sentiment analysis on the documents, and constructed a text processing model that is applicable to multiple languages and has high accuracy. Dos-Santos CN et al. [18] drew on the learning idea of continuous word vectors. First, they represented each character in the character table as a continuous vector and input it into a convolutional neural network to obtain the character structure information of the word vector. This word structure information representation was added to the original word vector so that the word vector could represent richer information.

4. Deep Learning-Based Text Sentiment Analysis Methods

4.1 Feedforward Neural Networks (FNNs) Structure and Their Application in Text Sentiment Analysis

FNNs generally refer to networks where each neuron starts from the input layer, receives input from the previous layer, and inputs it to the next layer until the output layer. There is no feedback in the entire network, which can be represented by an acyclic graph. Multi-Layer Perceptron (MLP) is a typical FNN, but MLP is a shallow network. In this paper, we mainly discuss deep feedforward neural networks based on Restricted Boltzmann Machine (RBM) [21][37][78]: Deep Boltzmann Machine (DBM) and Deep Belief Network (DBN).

4.1.1 Restricted Boltzmann Machine (RBM)

An Recurrent Graph (RBM) is a two-layer undirected graph model. The first layer is the visible layer, containing several units, which is also the network's input layer. The second layer is the hidden layer, containing several units, which is also the network's output layer. A typical RBM structure is shown in Figure 5. Here, represents the visible layer unit states, h = ( _h₁ , _h₂ , ..., _hₙ ) represents the hidden layer unit states, m and n are the number of units in the visible and hidden layers, respectively, represents the network parameters, represents the connection matrix between the hidden and visible layers, c = ( _c₁ , _c₂ , ..., _cₙ ) represents the bias of the hidden layer units, and b = ( _b₁ , _b₂ , ..., _bₘ ) represents the bias of the visible layer units. Unlike general networks, there are no direct connections between units in the same layer.

Figure 5. Network structure of RBM

In fact, RBM is a network model based on energy theory (for the evolution from general energy models to RBM, please refer to the detailed explanation in Section 5 "Energy-Based Models and Boltzmann Machines" in [6]). For energy models, the state distribution of the unit only needs to satisfy one of the exponential family distributions [92]. Different distributions can be selected for different practical problems, but some distributions will make the model difficult to train [32]. When all the unit state distributions satisfy the Bernoulli distribution, that is, each unit has only two state values, and the state space is selected as {0,1}, it can bring a lot of computational convenience, and the values of 0 and 1 also have good physical meaning. Therefore, most RBMs choose {0,1} as the unit state value.

When each unit takes the value {0,1}, the energy of the network can be described as follows:

The joint probability distribution of the visible layer and the hidden layer is as follows:

Z is the normalization term. The distribution of the visible layer p(v), that is, the edge distribution of p(v,h):

The goal of training the RBM network model is to maximize the log-likelihood of p(v). Stochastic gradients are used for training, and the gradient is approximated using the contrastive difference method [13]. For more detailed training methods, please refer to [20], which provides a clear introduction and includes pseudocode for training the RBM model.

4.1.2 Deep Boltzmann Machine (DBM) and Deep Trust Network (DBN)

Both DBM and DBN are deep feedforward networks based on RBM, with typical structures shown in Figures 6-left and 6-. They share many similarities but also have many fundamental differences, which will be explained below from three aspects: network structure, applicable problems, and training methods.

From a network structure perspective, both DBM and DBN can be viewed as stacked RBM networks, both being probabilistic graphical models. However, DBM is a completely undirected graph, corresponding to Markov random fields in graphical models; while in the DBN network structure, only the top two layers are undirected, forming a true RBM, and the remaining layers are directed. Therefore, DBN is a directed graph, corresponding to Sigmoid trust networks with many hidden layers and dense connections in graphical models.

The different network structures of DBM and DBN make them suitable for different problems. Due to the structural characteristics of DBM, the input and output can reason with each other, making DBM networks more suitable for constructing autoencoders for feature encoding and extraction [35][76]. In contrast, the state of lower-level units in DBN is only affected by the upper layer, that is, the reasoning structure is from output to input, that is, the distribution of data (i.e., the initial input) can be inferred from the category (output). Therefore, DBN is generally more suitable for classification [79].

Figure 6. Typical structure of RBM-based FNNs. Left: Network structure of DBN, a hybrid directed and undirected graph model;

Middle: The network structure of DBM, an undirected probabilistic graphical model; Right: The expanded topology when training the hidden layers of DBM.

In terms of training methods, both use similar training frameworks: first, a large amount of unlabeled data is used to perform greedy pre-training of the network layer by layer, and then some labeled data is used to adjust the network parameters [8][10]. However, due to the different network structures, the two will differ in pre-training and parameter adjustment: for DBN, due to the directionality of the network, during training, it is only necessary to treat the two adjacent layers as an RBM and train them layer by layer [8]; but for DBM, except for the first and last layers, each intermediate layer is affected by two aspects, and during training, the two adjacent layers are generally expanded into a three-layer neural network [85][36] (as shown in Figure 6-right).

4.1.3 Activation Function

We know that in the RBM model, the state of a neuron takes the values {0,1}. The state of a hidden layer unit is calculated based on the state values of the input layer and the weight matrix. However, the result of the calculation is not simply 0 and 1. In this case, the calculation result is often mapped before being output to the hidden layer unit. Since the selected mapping function determines whether the state of the hidden unit is activated (taking 1 indicates activation), this mapping function is generally also called the activation function.

Activation functions generally possess the following two characteristics: they can map the input to the value space of a neuron; and they are continuously differentiable because gradient calculation is required during network training. Two commonly used activation functions are the sigmoid function (also known as the Rogers function) and the hyperbolic tangent function tanh, as shown in Figure 7-left.

The Sigmoid function is more in line with the natural interpretation of neurons (it maps values to 0 and 1, which corresponds exactly to the state of the neuron; it takes the value 1 when activated and 0 otherwise), and it is suitable for shallower feedforward networks, such as multilayer perceptrons (MLPs) with only one hidden layer, while the hyperbolic tangent function can achieve better results in the training of deep feedforward networks [28]. GlorotX et al. [30] demonstrated the impact of different activation function choices on the training of feedforward networks through experiments.

Figure 7. Activation function curves. Left: Commonly used Sigmoid and hyperbolic tangent functions;

Right: Comparison of the Rectifier function suitable for sparse data and its smoothed form;

Bengio et al. [28] proposed a correction neuron suitable for sparse data and a deep sparse correction neural network, whose activation function is

To facilitate gradient calculation, a smooth gradient is used [16].

The function curves for both are shown in Figure 7 (right). This activation function is more suitable for problems where the data itself is highly sparse, such as text classification problems that use BOW or VSM for feature representation, including sentiment analysis.

4.1.4 Applications of FNNs in Sentiment Analysis

When DBM and DBN were first proposed, they were mostly applied to image-related fields (such as MNIST digit handwriting recognition [85]). Currently, many researchers are also drawing on the application methods of DBM and DBN in image processing and applying them to text sentiment analysis.

A common approach is to represent the text as a 0, 1 feature vector using the BOW model, use it as the original input, then construct a DBN model, and then train the DBN using a layer-by-layer greedy training method to obtain a DBN network that can be directly classified [76][96]. The main work of this approach is threefold: First, the construction of the document set dictionary. The choice of words as the dictionary affects the BOW feature representation of the text and also has a great impact on the subsequent training and classification results; Second, the construction of the network structure. For classification problems, the number of input layer units and output layer units of the network is determined, but the number of hidden layers and the number of units per layer need to be carefully considered. There is currently no good guidance on how to determine the network structure, and experience needs to be accumulated in experiments. Generally, the size of the network is positively correlated with the amount of data; Finally, there is the labeling of the training data. Although a large amount of unlabeled data can be used for pre-training of DBN, a relatively large amount of labeled data is still needed in the parameter adjustment process compared to traditional machine learning methods.

Based on this general framework, many researchers have made improvements. SunX et al. [86] introduced the social relationships formed by mutual comments in Weibo into the features, and together with the features obtained based on the sentiment dictionary, they constructed a feature vector for Weibo sentiment analysis. Then, they used a general DBN network model to perform sentiment analysis on Chinese Weibo. ZhouS et al. [101] combined the DBN network with the Active Learning [11] method. They used the Active Learning method to obtain the data that needed to be labeled and then adjusted the parameters of the DBN network to obtain better positive and negative sentiment classification results. In [102], ZhouS et al. introduced the Fuzzy factor. They first used the DBN training method to obtain the model parameters, then calculated the Fuzzy factor based on the trained model, added the Fuzzy factor to the DBN network, and then retrained to obtain the final classification model.

In addition to using DBN to directly train the classification model, DBM can be used to encode and extract features from the text, and then the extracted features can be used to train classifiers such as SVM, using deep networks only as tools for automatic feature extraction. In [29], GlorotX et al. proposed a domain-adaptive deep learning method for sentiment analysis of a large amount of multi-domain product review data based on this idea. They used DBM autoencoders with rectifier neural units [28] to encode features of product review texts. When training the network, review texts from different product domains were mixed together as training data. The product review text features extracted by this network can be domain-adaptive. We only need to use the features of product review texts in a certain domain and sentiment labels to train a sentiment classifier (such as SVM). This sentiment classifier can then perform sentiment analysis on product review texts from other domains.

4.2 Recurrent Neural Network (RNN) Structure and Its Application in Text Sentiment Analysis

4.2.1 Standard RecursiveNN

The standard RecursiveNN model was proposed by Goller C[25] and Socher R[81]. Its network structure is a binary tree. The leaf nodes are the basic units of the problem being processed, such as words in sentences and segmentation regions in images. The non-leaf nodes are the hidden nodes of the network. When processing, it is generally done in a low-to-high manner. There are three main ways to use the RecursiveNN model. One is to not fix the network structure and hope to automatically learn this hierarchical structure so as to automatically construct the lexical tree of sentences and construct the relationship between different segmentation regions of images[81]. The second is to obtain a fixed network structure of sentences based on lexical analysis, represent the words at the leaf nodes as vectors and use them as network parameters, and then obtain the vector representation of words and sentences by training the network model[77][64]. It is also possible to learn the sentiment polarity of phrases and sentences by fixing the network structure and use it for sentiment analysis of text. The following mainly analyzes the application of the RecursiveNN model in sentiment analysis of text.

To illustrate this more clearly, we'll use a three-word phrase as an example to explain how recursive neural networks are used for text sentiment analysis, as shown in Figure 8 (top left). For a sentence or phrase, lexical analysis is first performed to obtain the lexical tree shown. Then, words at the leaf nodes are represented as d-dimensional word vectors. Non-leaf nodes are also treated as units similar to words, and their vector representations are calculated using the function g based on their two child nodes. Each node outputs its sentiment label. For node a, its node label can be derived from...

The calculated weight matrix for sentiment classification is C, where C is the number of sentiment categories and d is the dimension of the word vectors. Many proposed RecursiveNN models use the same strategies for word representation and node label output; the difference lies in how they calculate the vector representation of hidden parent nodes.

4.2.2 Calculation of Hidden Nodes

In standard RecursiveNN, the vector representations of two child nodes are concatenated together and then fully connected to the parent node. The output typically uses the hyperbolic tangent function. For the triplet shown in the upper left of Figure 8, _p1 and _p2 can be calculated using the following combination equation:

Here, W is the weight matrix, a key parameter in standard RecursiveNN. It is globally shared, meaning that W is the same for every hidden node. f( ^. ) is the chosen non-linear output function, which operates element-wise on the input vector.

Socher et al. [80] believed that most parameters are related to words, and that the vector representation of long phrases calculated by nonlinear functions is related to the words involved. In standard RecursiveNN, many words in long phrases only interact through implicit relationships, such as a, b, and c forming the whole phrase, but the interaction between a and b, c is very weak. Based on this idea, Socher et al. [80] represented each word and phrase as a vector and matrix pair, as shown in Figure 8-lower left, and proposed Matrix-VectorRecursiveNeuralNetwork (MV-RNN). According to this representation, the hidden nodes ( _p1 , _P1 ) can be calculated by the following combination equation:

Each word matrix is d*d in size, and W, as in standard RecursiveNN, is used to compute the matrix representation of the hidden nodes and is also globally shared. ( _p2 , _P2 ) can be computed similarly.

MV-RNN can enhance the interaction between all words that make up a phrase, but each word and hidden node is represented by a matrix, making the network parameters too numerous and varying with the phrase length. In [83], the authors used a combination equation based on a global tensor to compute all hidden nodes and called this network a Recursive Neural Tensor Network (RNTN). Each word and hidden node is still represented by a d-dimensional word vector. Taking _p1 as an example, it can be computed as follows:

In equations (22) and (23), f( ^. ) and W are global tensors, just like in standard RecursiveNN, with each component matrix being a global tensor.

Figure 8-right shows the case when the word vector dimension is 2.

4.2.3 Using RecursiveNNs for Text Sentiment Analysis

When using RecursiveNNs for text sentiment analysis, the following three aspects need to be considered: initialization of word vectors; determination of network structure, i.e., lexical structure; training method and labeling of training data.

Word vectors can be randomly initialized using zero-mean, small-variance Gaussian noise, but generally, using pre-trained results from other continuous word vector learning methods (such as word2vec in Section 2) can achieve better results [49][90][84].

For network structure, a lexical analysis tree is usually constructed first through lexical analysis, and the network is built on a fixed tree structure [83][77][49]. Socher R et al. [84] introduced reconstruction error and gradually greedily constructed network layers automatically. Reconstruction is to reconstruct the word vectors of two child nodes based on the word vectors of the parent node and the reconstruction connection matrix, and calculate the reconstruction error based on the error between the reconstructed child node word vectors and the word vectors of the atomic nodes. Gradual greediness is to select two adjacent nodes as two child nodes each time, calculate the word vectors of the parent node, then reconstruct the two child node vectors, calculate the reconstruction error, and each time select the combination with the smallest current reconstruction error for merging, until merging to the root node.

The training of RecursiveNNs is generally carried out using a supervised method, with the optimization objective being to minimize the sum of the errors between the sentiment distribution of each node in the network and the real labels [83]. Stochastic gradient descent and backpropagation algorithms are used for training. Therefore, the sentiment polarity of each word and phrase in the training sentences needs to be labeled, which is a relatively large amount of manual labeling work. Existing sentiment dictionaries can be used to help with training [83][77]. Socher R et al. [84] used a semi-supervised method. In the optimization objective of the model, in addition to minimizing the sum of the errors of sentiment labels, they also considered minimizing the sum of the reconstruction errors of all nodes. The calculation of the reconstruction error does not require labeled data.

4.3 Convolutional Neural Networks (CNNs) Structure and Their Application in Text Sentiment Analysis

CNNs were initially proposed by LeCunY et al. and applied to handwritten character recognition in the field of image [50][53][51]. LeNet-5, built based on deep CNNs, achieved good results. In recent years, deep CNNs have achieved remarkable results in image object classification [42], image semantic segmentation [27] and speech recognition [23]. Compared with other deep network models, CNNs mainly have the following characteristics: they can perform convolution calculation on the input through convolution kernels (also called filters in many literatures) to obtain local structural features [51], and can extract more semantic-level feature representations layer by layer; they can use fewer parameters to represent more neuronal connections through weight sharing [3]; they can use multiple channels (also called feature maps in many literatures) to extract multi-dimensional features from the input; and there are often downsampling layers (also called pooling) after the convolutional layers to reduce the complexity of the network.

Based on the experience of using CNNs in the fields of image and speech, many researchers have proposed CNNs for text modeling and analysis[47][51][45][40]. These networks have drawn on many of the practices of CNNs in image processing, but there are also many changes for text processing. The three most commonly used CNN construction methods for text modeling and sentiment analysis are shown in Figure 9.

Text sentiment analysis methods based on deep CNNs generally have the processing flow shown in Figure 10. The following describes the specific operations of three commonly used CNNs in text sentiment analysis at each step, and finally provides an overview of the methods for applying these models to text sentiment analysis.

4.3.1 Text Representation in CNN Structure

In CNN methods used for text processing, words are first represented as continuous vectors, denoted as the column vector representation of the word with index j in the dictionary. Generally, the vector representations of all words in the dictionary are first represented by a matrix, where |V| is the size of the dictionary. When using word vectors, the column vector representation of the corresponding word is obtained by looking up a table.

W _{dic} , as a parameter of the CNN network, can be processed in the following ways: random initialization, which is continuously adjusted as the network is trained; initialization using word vectors learned by methods such as word2vec, which remains fixed; and initialization using word vectors, which is fine-tuned as the network is trained. KimY[47] tried all of these methods and showed through experiments that the last method can achieve the best results, while the first method of random initialization has the worst results. For a sentence of length ι , it can be represented in the form of a matrix according to the order of each word in the sentence.

Figure 9 Common CNNs used for text modeling and sentiment analysis. Top left: The processing method used in each step of the network shown is a common practice, denoted as Common Convolutional Neural Network (C-CNN) [47]; Bottom left: Convolution operation is applied to the energy-based RBM, mainly used for text feature extraction and dimensionality reduction, denoted as Convolutional Restricted Boltzmann Machine (CRBM) [40]; Right: Max-k pooling is used in each pooling layer, and the k value of each pooling layer is dynamically calculated, denoted as Dynamic Convolutional Neural Network (DCNN) [45].

Figure 10. General processing steps of CNNs for text sentiment analysis

4.3.2 Key Steps of CNN:

A. Convolution operation

In CNN networks used for image processing, convolution operations are matrix convolutions without zero padding. This involves sliding the kernel matrix across the input matrix to obtain the convolutional result matrix. In CNN networks used for text sentiment analysis, there are two types of convolution operations: one is a matrix sliding window convolution similar to that used in images, and the other is a vector convolution. Let the kernel be M, and the input be a matrix representation of the sentence. C-CNN (as shown in Figure 9 - top left) and CRBM (as shown in Figure 9 - bottom left) use a matrix sliding window convolution operation, with _a width M and a window size N. The convolution will result in a vector c.

There are generally two ways to slide the convolutional window from the beginning to the end of a sentence. In C-CNN, the convolutional kernel is aligned with the beginning of the sentence and slides one word at a time until the end of the sentence. In this case, each element _cj in c can be calculated as follows:

In many articles, this method is represented as concatenating the word vectors in the window into a long vector and then performing convolution, such as [18][2][14]. Its mathematical essence is consistent with that described above.

In CRBM, an n-gram approach is used. Centered on the current word, windows of size (n-1)/2 are taken before and after it. Each word is slid across the window, and zeros are padded. In this case, each element _cj in c can be calculated as follows:

DCNN employs vector-based convolution operations. The width _N of the convolution kernel M is also d. Corresponding columns in S and M are convolved using vectors, and the final convolution result is a matrix C. Note:

It is a column vector.

In vector convolution operations, each element _cj in the result c obtained by convolving the kernel vector and the convolved vector can be calculated as follows:

Based on equation (26), two operation methods can be derived: the narrow method requires ι ≥ n This means that the wide approach has no size requirement for either element, which will cause the index of s to exceed the boundary when calculating elements of c where j ≤ n-1 and j > ι . Zero padding is then used in this case. The wide approach is used in DCNN, therefore...

B. Multichannel Convolution

In CNN networks, to extract more dimensional features from the input, multiple convolutional kernels are often used for feature extraction, leveraging the concept of color channels in images; this is often called multi-channel convolution. In CNN networks processing text sentiment analysis, multi-channel convolution is handled differently depending on the convolution method and network structure. Generally, using matrix sliding window convolution, each convolutional channel yields a separate result vector, independent of each other. However, for convolution using column vector components, each convolutional channel produces a result matrix, which can be used as input to subsequent convolutional layers (usually undergoing pooling and linear mapping, which are ignored here). In subsequent convolutional layers, the input to each channel is the result matrix of all convolutions from the previous layer, as shown in Figure 9-right. Let C ^ _ij be the convolution result of the i-th channel of the j-th convolutional layer, M ^ij _ be the i-th convolutional kernel of the j-th convolutional layer, and the j-th convolutional layer has m _j channels. Then, the output of the (j+1)-th convolutional layer is:

* indicates a convolution operation.

C. Pooling

In CNN networks, a pooling layer is typically followed by a convolutional layer for downsampling. The main purpose of the pooling layer is to reduce the dimensionality of the local features obtained after convolution and to integrate these features to obtain higher-level features. Commonly used pooling methods include MaxPooling, k-MaxPooling, Dynamick-MaxPooling, and Folding.

MaxPooling is suitable for networks using sliding window convolution: each channel results in a column vector after the convolution operation, and the maximum value of each column vector is taken to form a new feature vector, which is then output to the fully connected layer. CNN networks using MaxPooling generally contain only one convolutional layer because the input has been transformed into a one-dimensional vector after pooling, making convolution operations unsuitable. Similar to MaxPooling, the minimum or average value of the column vectors after convolution can also be obtained.

k-MaxPooling is suitable for networks using column vector component convolution: each convolution operation produces a matrix, and the maximum k value in each column is used to form a new matrix. k-MaxPooling essentially changes the original sentence length to a fixed k, providing convenience for handling texts of varying lengths. The result after pooling is still a matrix, which can be used to add more convolutional layers for higher-level feature extraction. Generally, we want the feature dimension to decrease progressively; therefore, the k value in the pooling layer after each convolutional layer needs to decrease progressively. This pooling method, which allows for dynamically changing k values, is called Dynamick-MaxPooling. The k value of the j-th pooling layer can be calculated as follows:

Where K _top is the value of the last Pooling layer, J is the number of Pooling layers, ι is the length of the text sentence, and [] is the rounding up function.

Folding is similar to downsampling after an image convolutional layer. It merges adjacent column vectors by averaging, maximizing, or minimizing them, thus sampling a matrix of dimension ι xd to lxd/2.

D. Nonlinear mapping

For multi-layer CNN networks, sometimes after convolution and pooling, the data is not directly input to the next convolutional layer, but is mapped through a non-linear function before being input to the next layer. Commonly used functions include Tanh, Sigmoid, and Rectier. The Rectier function is a good choice in text processing, as illustrated by the authors in [47]. In fact, we can see some advantages of the Rectier function from the shape and derivative of the three functions. Its derivative is 1 when it is greater than 0, which can effectively propagate the error of the next layer back to the previous layer, thus accelerating the convergence speed of training.

F. Fully Connected Layer

Networks like C-CNN and DCNN, which are directly applied to sentiment classification, add several fully connected layers after the final pooling layer. Networks with only one fully connected layer generally use the Softmax classification method to determine the sentiment category of the text; while networks with multiple fully connected layers treat the later fully connected part as a multilayer perceptron and use dropout[34] to optimize the training method of the network.

4.3.3 Application of CNN in Text Sentiment Analysis

There are two main approaches to using CNNs for text sentiment analysis: one is to add several fully connected layers at the end of the CNNs network and directly perform text sentiment classification [47][45][18][73][14][47]; the other is to use CNNs to extract text features and then use classifiers such as SVM to perform sentiment classification [40][2][69].

KimY[47] used a network similar to C-CNN to process many sentence classification problems including sentiment classification. The difference is that two channels were used at the initial input: one is to use word vectors trained by word2vec to represent words and text, and these word vectors are fixed, called the static feature channel; the other is similar to the first, but will be fine-tuned during the training of the entire CNN network, called the dynamic feature channel. Dos-SantosCN et al.[18] used a deep convolutional neural network to classify the sentiment of movie reviews and Twitter texts. Unlike the general word representation, the authors added character-level word vectors when learning the vector representation of words. The learning of character-level word vectors is similar to the way of obtaining sentence vectors from word vectors: each character in the character dictionary is represented as a vector, and then a network similar to C-CNN is used to obtain the character-level vector representation of words. After obtaining the character-level vector representation of words, the word-level vector representation is added, and then a network similar to C-CNN is used to learn the vector representation of sentences, and then the fully connected layer is input for sentiment classification.

Huynh T et al. [40] proposed the CRBM network for learning text features layer by layer, and then classifying the subjectivity and sentiment of product reviews. As introduced above, each time k convolutional kernels of the same size but different weights are used to convolve the input sentence matrix, changing the dimension of the word vector from d to k. After each convolution operation, no pooling or other processing is performed, and the data is directly input to the next convolutional layer. After the last convolution, all word vectors in the sentence are summed and averaged to obtain the feature vector of the sentence, which is then used to train the SVM and perform classification.

4.4 Recurrent Neural Networks (RNNs) Structure and Their Application in Text Sentiment Analysis

Recurrent Neural Networks (RecurrentNNs) have a long history of development, as can be seen in [97]. They are mainly used in speech processing and related problems. In recent years, RecurrentNNs have been widely used in text processing and related problems, as can be seen in the summary of articles. This article mainly introduces the basic RecurrentNNs and the most commonly used variant networks, and summarizes the methods of using RecurrentNNs for text modeling and sentiment analysis.

4.4.1 Basic RecurrentNNs

Unlike other neural networks, the state of each unit in a RecurrentNN is temporally ordered (corresponding to the order in which words appear in a text). The output of its hidden layer can be cached and used as part of the input for the next time step; therefore, the hidden layer units in RecurrentNNs are generally also called memory units. The state of the hidden layer in the previous time step is cached and used together with the input layer in the current time step as the generalized input to the hidden layer. The generalized input and the hidden layer are fully connected. As shown in Figure 11-left, we will discuss the network structure and training method using a basic RecurrentNN with one hidden layer as an example.

Let x _t be the input. For text sentiment analysis, this is generally the concatenated vector of word vectors from the current n-gram phrase or sentence. Each time step processes one n-gram phrase or sentence. Let h _t be the hidden layer and y _t be the output. The values of each unit in the network can be viewed as states with temporal order, where t is the current processing time step. Let U, W, and H be the fully connected weight matrices between the buffer of the previous hidden layer and the hidden layer, between the input layer and the hidden layer, and between the hidden layer and the output layer, respectively. The state update equations for each layer are as follows:

f( ^. ) and g( ^. ) represent non-linear activation functions. Generally, f( ^. ) is the Sigmoid function, and g( ^. ) is either the Sigmoid function or the tanh function.

When calculating the current state, RecurrentNNs can consider the influence of many preceding time sequences. In text terms, this means that they can handle the influence of many preceding phrases (sentences). The node states of the network have temporal relationships, and the training process is different from other networks. The commonly used method is the back-propagation through time algorithm (BPTT) [5][70]. As shown in Figure 11, we expand the node relationships of three adjacent time sequences and perform gradient propagation path during backpropagation. In fact, when calculating the forward state, the temporal sequence that needs to be considered is also expanded in a similar way.

4.4.2 Long-Short-Term-Memory

Basic RecurrentNNs have significant drawbacks: it is difficult to consider the influence of inputs from a distant time period (corresponding to the modeling of text sentences, it is impossible to handle long sentences). When the time sequence is too long, gradient inflation and vanishing problems will occur when training the network [31][66], which will make it impossible to learn the network parameters.

针对基础RecurrentNNs的缺陷，人们提出了很多使用门的记忆单元来构建网络。

中Long-Short-Term-Memory(LSTM)[26][39]可以根据输入和记忆单元的状态自动地决定隐含单元的动作：遗忘记忆状态、接收输入、输出状态值到后续网络层，是目前RecurrentNNs中使用最广泛的隐含记忆单元。

常用的LSTM单元如图11-右所示。从LSTM单元外部来说，对于使用LSTM构建的网络，它与基础RecurrentNNs中隐含单元是类似的：接收输入层的输入x _t ，经过非线性映射向下一层传递本层的输出值h _t 。在LSTM内部，除了可以缓存上一时序的隐含层状态之外，还添加了分别控制接受输入、进行输出以及选择遗忘的“门”，分别使用g _it 、g _ot 、g _ft表示在当前时序t时三个门的开闭状态。控制三个门的输入是相同的：当前时序的输入x _t以及隐含层缓存状态s _t-1 ，因此我们也知道，对于同一隐含层的不同LSTM单元，它们共享三个门的状态。不同的是各自的权值矩阵，分别用W _i 、W _o 、W _f表示输入与三个门之间的连接权值矩阵U _i 、U _o 、U _f ，表示隐含层缓存状态与三个门之间的连接权值矩阵。在时序t时，三个门的开关状态可以如下计算：

其中σ ( ^. )表示Sigmoid函数。隐含单元的内部状态s _t以及输出h _t按如下方程更新：

对由LSTM单元构建的RecurrentNNs，训练方法也是采用BPTT的思路。除了LSTM，目前也有许多基于“门”的RecurrentNNs单元[12][88]，它们的主要思想基本一致，都是希望这些记忆单元可以自动地处理时序关系，可以自动地决定记忆单元的数据输入、状态输出以及记忆遗忘。

4.4.3基于RecurrentNNs的文本情感分析

将RecurrentNNs应用于文本情感分析问题的处理，主要有两种方式。第一种基于语言模型对文本词语序列进行建模，然后学习词语向量表示[59][75]。对于词语的处理方法与word2vec方法中很类似，都是将词语表示成词语向量。然后在每个时序向RecurrentNNs输入一个n-gram词组的词语向量（由各个词语的词语向量拼接而成），模型的输出是对n-gram词组的中心词语的概率分布进行估计。

第二种方式是对语句或者文档进行建模，获取语句或者文档的向量形式特征，然后再进行文本情感分析。RongW等人[72]使用含有两层基础回环隐含层的网络组成对偶形式的RecurrentNN来学习电影评论语句的向量表示，然后分析这些影评的情感分布。KirosR等人[46]以及TangD等人[75]则使用基于门单元构建的深度网络对语句进行建模。

4.5深度网络的参数学习

一般来说，对于深度网络的参数训练都是采用基于SGD的BP算法，根据所处理的具体问题并结合使用的网络结构定义需要优化的目标函数，确定需要解决的优化问题，将对网络参数的训练转化为对优化问题的求解。

深度学习方法中求解优化问题的方法一般都是使用SGD算法，当然是否有其他更好的优化方法，是一个值得探索的问题。其关键在于如何将在输出层产生的参数梯度传播到中间的隐含层以及输入层，来更新网络参数。当网络结构确定时，反向传播的过程一般就可以方便的进行：对于FNNs以及RecursiveNNs，梯度反向传播的过程就是输出正向计算的逆过程；而对于CNNs，由于Pooling层的存在，在Pooling前后层之间的梯度传播需要更多的考虑，如对于Max-Pooling方式，梯度反向传播时将梯度按照Pooling前所有单元值进行加权平均分配还是直接将梯度全部给予具有最大值的单元。

关于深度网络模型的具体训练方式，本文没有一一给出详细说明，BP算法更像是一种训练框架而不是具体的某种具体算法，不同的网络以及不同问题定义出的目标函数都会导致BP算法中梯度计算方式的不同，我们很难统一的给出BP算法的具体过程，每种深度网络的具体训练方式，请参阅相关文献。CNNs在文本建模和文本情感分析问题中被广泛使用，在附录A中我们给出了一个浅层CNN网络的梯度推导，并给出了一种在多核集群中，基于MPI的训练方法。（未完待续）

A Review of Deep Learning Methods for Text Sentiment Analysis

Read next

Introduction to reference membranes for air permeability testing calibration

CATDOLL 146CM Jing TPE

CATDOLL 128CM Katya Silicone Doll

CATDOLL 130CM Kiki