Memory is a major component of both the brain and computers. In many areas of deep learning , we extend the capabilities of deep networks by matching them with memory; for example, in question-and-answer, we first memorize or store pre-processed information and then use that information to answer questions.
We extend the functionality of neural networks by connecting them to external storage resources and interacting with these resources through memory processes.
To the layperson, we've created a memory structure, typically an array, into which we write or read data. Sounds simple, right? But it's not. First, we don't have unlimited storage to hold every image or sound we encounter; we access information through similarity or relevance (not necessarily an exact match). This paper discusses how to use NTMs to process information. Our interest in this paper stems primarily from its status as a crucial starting point in many research areas, including NLP and meta-learning.
Memory structure
Our memory structure Mt contains N rows and M elements. Each row represents a piece of information (memory), such as your description of your cousin.
Read
In traditional programming, we access memory using Mt[i]. However, for artificial intelligence, we obtain information through similarity. Therefore, we introduce a weighted reading mechanism, meaning that the result we obtain is a weighted sum of memory.
The total value of all ownership assets equals 1.
You might immediately ask what the purpose of doing this is. Let's explain with an example. A friend hands you a drink that tastes a bit like tea and feels like milk. By retrieving memories of tea and milk and applying linear algebra, you conclude that it's bubble tea. It sounds magical, but in word scrambling, we also use the same linear algebra to handle relationships. In other examples, such as questioning and answering, it's crucial to combine information based on accumulated knowledge. A memory network helps us achieve our goals effectively.
How do we create these weights? Of course, we rely on deep learning . The controller extracts features (kt) from the input information, and we use them to calculate the weights. For example, when you make a phone call, you can't immediately recognize the other person's voice; it sounds like your cousin, but also seems like your older brother. Through linear algebra, we might be able to identify him as your high school classmate, even if the voice is completely different from what you remember.
By calculating the weight w and comparing the similarity between kt and each of our memories, we calculated a score K using cosine similarity.
Here, u is the feature quantity kt we extracted, and v represents each row in our memory.
We apply the softmax function to the score K to calculate the weight w. βt is added to amplify or reduce the difference between the scores. For example, if it is greater than 1, the difference is amplified. w is based on similarity-based retrieval information, which we call content addressing.
Write
How do we write information into memory? In LSTM, the internal state of a memory cell is determined by its previous state and the current input value. Similarly, the memory writing process also consists of the previous state and the new input. Here, we first clear part of the previous state:
`et` is a clearing vector. (The computation process is similar to the input gate in an LSTM.)
Then, we write in the new information.
"at" is the value we want to add.
Here, through the controller that generates w, we can write or read information into memory.
Addressing mechanism
Our controller extracts information by calculating w, but using similarity (content addressing) to extract information is not powerful enough.
Replenish
'w' represents our current focus (attention) in our memory. In content addressing, our focus is only based on new input. However, this is insufficient to explain problems we encounter recently. For example, if your classmate texted you an hour ago, you should be able to easily recall their voice. How do we utilize previous attention when acquiring new information? We calculate a merging weight based on the current focus and previous focus. Yes, this sounds somewhat like the forgetting gate in LSTM or GRU.
Calculate g based on the previous focus and the current input.
Convolution Transformation
The convolution transformation completes the focus transformation. It was not specifically designed for deep learning . Instead, she revealed how NTM performs basic algorithms like copying and sorting. For example, without accessing w[4], we want to move each focus 3 rows, that is, w[i]←w[i+3].
In convolution transformations, we can move the desired focus to a specified row, i.e., w[i] ← convolution (w[i+3], w[i+4], w[i+5]). Typically, convolution is simply a linear weighted sum of rows: 0.3×w[i+3] + 0.5×w[i+4] + 0.2×w[i+5].
This is the mathematical formula for focus transformation:
In many deep learning models, we either ignore this step or set s(i) to 0, with the exception of s(0)=1.
Sharpen
Our convolution shift works like a convolution blur filter. Therefore, when needed, we apply sharpening techniques to the weights to achieve a blurring effect; γ will be another parameter output by the controller when sharpening the focus.
summary
We retrieve information from memory using weights w. w includes factors such as the current input, previous intersections, possible transformations, and ambiguities. Here is a system block diagram where the controller outputs the necessary parameters used to calculate w at different stages.