The accuracy of automatic computer grading of subjective questions largely depends on the accuracy of its text similarity calculation . Because text similarity calculation has wide applications in document copy checking, information retrieval, and machine translation, more and more scholars have devoted themselves to the research of text similarity algorithms in recent years. Generally speaking, text similarity calculation methods can be divided into two main categories: one is statistically based methods, which require large-scale corpora and do not consider sentence structure and semantic information during calculation, sometimes resulting in results that do not match human understanding of natural language; the other is semantic understanding-based methods, which do not require large-scale corpora but rely on hierarchical semantic dictionaries, resulting in relatively accurate calculations that better align with human understanding of natural language. The following section introduces several classic text similarity calculation methods and provides a brief analysis of their respective performance.
1. Computational methods based on vector space models
The Vector Space Model (VSM) is an information retrieval model that has proven effective and widely adopted in recent years. In this model, text is considered to be composed of a series of independent words. If a document D contains words t1, t2, ..., tN, then the document is represented as D(t1, t2, ..., tN). Since words in a document have varying degrees of importance, and this importance significantly impacts text similarity calculations , each word in the document can be assigned a weight w to represent its weight. This weight is represented as follows: D(t1, w1; t2, w2; ..., tN, wN), which can be simplified to D(w1, w2, ..., wN). Here, wk is the weight of word tk, where 1 ≤ k ≤ N. This represents text as a vector, and the similarity between two texts can be calculated by the angle between their vectors; the larger the angle, the lower the similarity.
The vector space model-based computational method assumes that the words in a text are independent of each other, and therefore can be represented in vector form. This representation simplifies the complex relationships between words in the text and makes the similarity of texts calculable. The weights of words in the vector representation should reflect the importance of that word to the entire text, typically represented by statistically obtained word frequencies; the combination of all vector components should be able to distinguish this text from other texts.
Numerous statistical results indicate that the most frequently occurring words in a text are often function words reflecting the grammatical structure of the sentence and core words used by the author to explain a particular issue. If the texts revolve around the same core issue, their core vocabulary should be similar. Therefore, neither the highest nor the lowest frequency words are suitable as feature words for text; only words with frequencies between the highest and lowest frequencies are suitable as feature words.
Words that appear frequently in a text should have higher weights. Therefore, when calculating the weight of a word in a text, the frequency of the word in the text should be considered, denoted as tf. However, considering only this is insufficient. If a word appears not only in one text but in many texts in a collection—for example, the word "的" (de) has a fairly high frequency in Chinese texts—it doesn't help us distinguish between texts; in other words, such a word lacks discriminative power. Therefore, when calculating word weights, the document frequency (df) of the word, i.e., the number of documents containing the word, should also be considered. Since the weight of a word is inversely proportional to its document frequency, the inverted document frequency (idf) is derived, which is also inversely proportional to document frequency. Its calculation formula is idf = logN/n (where N is the total number of documents in the document collection, and n is the number of documents containing the word). Thus, the weight of the feature word t in document D is weight(t,D) = tf(t,D) * idf(t). The tf*idf formula is used to calculate the weights of feature terms, emphasizing both the importance of words in the text and their discriminative power. Therefore, words with high tf*idf values are likely important in a document and also infrequent in other documents. This method can be used to select words as feature words for text vectors.
Once the feature words are selected, the vector representation of the text can be determined. With the text vectors, we can calculate the text similarity. There are many methods for calculating similarity, including:
Inner Product Method
Cosine method
Dice Coefficient
Jaccard Coefficient Method
2. Calculation method based on Hamming distance
The methods described above are based on vector space techniques, representing text as vectors in space and calculating the similarity between texts by measuring the angle between these vectors. However, the text similarity calculation method based on Hamming distance differs from the above methods. It is not based on vector space techniques but relies on Hamming distance from coding theory, calculating the similarity between two texts by measuring their Hamming distance. The advantage of this method is its simpler computation process.
First, let's introduce what Hamming distance is in coding theory. Hamming distance describes the distance between two codewords of length n. For example, to calculate the distance between codewords x=(x1x2…xi…xn) and y=(y1y2…yi…yn), the formula is as follows:
The operator ⊕ represents modulo-2 addition, and xi and yi take values of 0 or 1. The data D(x,y) calculated by this formula represents the number of different symbols in codewords x and y, thus reflecting the differences between codewords x and y. The larger the value of D(x,y), the lower the similarity between the two codewords.
When calculating the similarity between texts using this method, relevant information such as keywords is first extracted. This information is then arranged into a code form, and the text's information is represented by these codes, forming a one-to-one correspondence between text and code. For example, text D can be represented as D=(10100111001101011), where 0 and 1 represent the state of the text information at that position within the text. If 0 indicates that the information in text D at that position is absent, then 1 indicates that the information in text D at that position is present; the reverse is also possible. Based on the above explanation, we can easily represent text in codeword form. To calculate the similarity between two texts, we can use the calculation result from the above formula. If the codeword length is n, the distance between the two codewords calculated using the above formula will be between 0 and n. When the result is n, it means that all information in the two texts is different; conversely, when the result is 0, it means that all information in the two texts is the same. This method of deduction is clearly not intuitive, and if the value of n is different, it will be difficult to compare the text similarity. Therefore, we first need to determine the codeword set of the entire text set, and then represent each text as its corresponding codeword. For texts D1=(x1x2…xi…xn) and D2=(y1y2…yi…yn), the similarity calculation formula is defined as follows:
Here, xi and yi are the i-th components in the codewords corresponding to texts D1 and D2, respectively, with values of 0 or 1. ⊕ is still a modulo-2 addition operation, which is very convenient and fast for computers. The text similarity calculated using the Sim(D1,D2) formula has a value between 0 and 1. When the result is 0, it means that the two texts are completely dissimilar, and when the result is 1, it means that the two texts are very similar, which conforms to normal human cognitive patterns.
The Hamming distance-based text similarity calculation method avoids the complex computations used in vector space-based techniques, relying instead on fast, modular arithmetic like modulo-2 addition, resulting in faster computation. Furthermore, this method utilizes text information beyond isolated keywords, offering the possibility of combined descriptive text information. However, selecting and arranging text information to form a codeword set that corresponds one-to-one with the text remains a challenging issue requiring further research in text similarity calculation using this method.
3. Computational methods based on semantic understanding
Text similarity calculation methods based on semantic understanding differ from those based on statistics. These methods do not require large-scale corpora or extensive, lengthy training. They typically require a hierarchical semantic dictionary, calculating similarity based on the hyponym/hypernym or synonym relationships between concepts. Since text similarity calculations largely depend on the words that make up the text, semantic understanding-based similarity calculation methods are no exception. They generally calculate word similarity by measuring the distance between two words in the semantic tree structure. Therefore, hierarchical semantic dictionaries such as WordNet, HowNet, and thesaurus are commonly used. Many text similarity calculation methods are based on semantic dictionaries. Some calculate word similarity by finding the shortest path formed by hyponyms/hypernyms in WordNet; others calculate relevance based on the maximum information content of the common ancestor node of two words in the dictionary; and domestic methods also utilize CNKI or thesaurus to calculate semantic similarity.