Yuwei Zhu

One possible strategy for trading the financial markets is utilising reinforcement learning algorithms, and frame the trading task as a "game", where the agent collects rewards for good trades and negative rewards for bad trades. This particular personal project does a deep dive into common reinforcement learning strategies, before applying them to the stock market.

Before diving straight into a trading environment, we will sanity check our implementation and tune the function approximator (neural network) using the well known cartpole game provided by OpenAI Gym/Gymnasium (fig. 1). The code used in this project can be found here. Included in the repository is a full set of python unittests which check basic functionality of each of the implemented models to ensure that the model is working as intended.

Figure 2: Description of the DQN algorithm

We will be utilizing the Q-learning algorithm (fig. 2), which is an off-policy algorithm (which means it chooses actions in a different manner compared to the policy the agent is trying to learn) which uses a temporal difference TD(0) update. When using a neural network to approximate the action-value function (usually denoted with Q(s,a)), the network is commonly referred to as DQN. The DQN can be augmented in certain ways to produce other variants which can excel under certain circumstances; namely Double DQN, DynaQ, DynaQ+ and even dueling DQNs (which will not be covered in this project). We will be testing the efficacy of these models on the aforementioned environment, before eventually moving to predicting optimal trades. The DQN itself is a very powerful model which was demonstrated to play many Atari games with good results and even achieving super human play in some games. The Q-learning algorithm can be applied to both tabular and continuous environments, by estimating a value for each possible action and state combo. The Q-learning update is mathematically proven to converge to the optimal policy under certain conditions; namely that each state is visited an infinite number of times. However, as we can only train for finite time periods, sometimes the Q-learning agent can get stuck in a local minima, and one commonly known issue is overestimation. In fact, when performing a deep dive on the DQN using a basic toy environment which always remains in the same state, with two possible actions, 0 and 1 - where 1 always yields a reward of 1, whereas the action zero permanently grants zero reward, I observed that the estimated value for action zero often gets overestimated during training, despite the model never encountering a reward for that action; this is because of the greedy term which takes the maximum value of the value function for the next state.

Here we show the reward versus time plots for the DQN itself, and it can be seen that the model is highly unstable and struggles to reach the optimal policy, even after 5000 episodes of training.

We can augment the DQN model such that we duplicate the Q-network to form a policy network and a target network. The intuition behind this construction is to mitigate the overestimation problem and the "moving target" problem. By fixing the one network and updating the other, we can fix the policy while updating which makes the training much more stable. Obviously the fixed network prediction will be bad initially, but it is updated periodically to match the network that is being updated. Additionally, we experiment with hard and soft updates, which are complete updates at intervals, or small interpolations at each time step.

Figure 5: Double DQN with hard updates every 500 steps

Figure 6: Double DQN with soft updates - interpolate 0.5% at every step

The performance of such agents also changes with different activation functions used within the neural network, and we experiment with ReLU (used in all experiments above), GeLU and Tanh. GeLU exhibits the strongest performance, being a stronger alternative to ReLU in most cases, due to preventing neurons from "dying out".

Figure 7: Double DQN using Tanh nonlinearity

Figure 8: Double DQN using GELU nonlinearity

Finally, we test the performance of the DynaDQN algorithm. This is another augmentation of the DQN algorithm, where this time we simultaneously build a model of the environment internally, which we can use to plan ahead. What this means is past observations are used to train an internal world model which tries to learn - in this case - a mapping from (state, action) to next state, reward. By doing this, we can condense our past knowledge into a model which can be used to generate additional fabricated observations which can be used to supplement actual observations. Every training step that uses these generated observations are called planning steps, and in general, this allows the DynaDQN to learn the optimal policy from fewer interactions with the environment. This can be useful when interactions with the world is costly. For the world model we use a random forest regressor.

Figure 10: Dyna DQN using Random forest as world model (3 planning steps)

Using 3 planning steps for each actual observation, we can observe that the DynaDQN algorithm training appears more stable and converges to the optimal policy in less episodes compared to all previous models. Note that the "quality" of such fabricated observations are lower than actual observations (due to innaccuracies in the world model). So it is important to ensure that not too many planning steps are used.

The last model I implemented is the Dyna Q+ algorithm, which records the last time an action was chosen in a particular state, and uses this to encourage exploration over time of under trialled actions. This is especially powerful for changing environments as the algorithm is more inclined to explore new actions if they haven't been used in a while. For this particular task, which uses a static environment, the extra exploration provides no additional benefit. In fact, the extra exploration of potentially bad actions means that the algorithm takes more episodes to converge to the optimal policy.

Abstract:

It has long been known that Short-Term memory is a key component of the human brain, used for learning skills for one-time or periodic use. This skill is also useful in language modelling, as oftentimes themes or topics of a corpus of text appear and reappear throughout an article, necessitating the need for the element of forgetting in language models, which is not currently present. By utilizing ideas presented in recent works about the Short-Term Plasticity Neuron (STPN) and the Linearized Transformer, we present a new language model, named the STPN-transformer (STPNt), which builds on the idea of the Linearized Transformer by augmenting it with learning to learn and forget. Experimental results demonstrate the advantages of the STPNt over the Linear Transformer in limited memory settings on the Wikitext-2 dataset. Our research in this thesis begins to demonstrate some of the potential advantages that this type of learning can provide in language modelling, to be built on in future works.

Contributions to field:

Propose three variants of a novel architecture named the STPNt (Short-Term Plasticity Transformer), which applies a STPN-like mechanism to the Linear Transformer attention formulation.
Show that the STPNt model can learn proficiently and show advantages on a toy problem: The Associative Retrieval Task (ART)
Introduce a new variant of the ART, named AdaptART, which is more complex and further differentiates the performance of models with inherent forgetting mechanisms.
Propose a biologically plausible practical addition to the original STPN mechanism, which is shown to stabilize training and improve performance, for the Short-Term Plasticity-RNN (STPNr) as well as our model, the STPNt.
Show experimental results on the Wikitext-2 dataset that demonstrates that the STPNt provides advantages over the Linear Transformer in limited memory settings.

Summary of Results, Conclusion and Further Work:

In this work, we present three models: STPNt, STPNt with context, and the STPNt-LSTM, all adapted from the linearized Transformer model. The Linear Transformer provides linear scaling complexity with respect to sequence length, but does this by approximating the softmax activation in the Transformer, naturally leading to a performance decrease. The attention layer in the Linear Transformer maintains a cumulative matrix which retains information from each time step. Due to the purely additive updates at each time step, the entries of this matrix increase unbounded over time, which hinders performance on long training and evaluation sequences. The models we introduce apply the idea of fast weight programming, where certain network parameters are trained in order to modulate what information to retain and forget at each time step. The three variants propose three different methods of modulation:

The STPNt modulates in a constant manner. This network learns a constant decay parameter which applies at every time step. Therefore, this model learns an effective context window from the training data. We refer to this model as the STPNt or basic STPNt.
The STPNt with context modulates based on the current token only. Thus, this model learns which specific tokens are more or less useful, as well as particular tokens that may indicate the start of a new topic where the model could reset its current state.
The STPNt-LSTM utilizes a LSTM cell, which collates contextual information over many time steps, and utilizes the output of this cell to determine suitable checkpoints to reset its context, as well as determine which tokens are more or less useful based on past context.

Since the both the Linear Transformer and STPNt variants approximates the Transformer attention, there is an inherent decrease in performance. For this reason, in our results, we mostly keep the discussion between our models and the Linear Transformer instead of comparing with the Transformer itself.

Figure 1: (Left half): Toy problem results; Associative Retrieval Task with replacement, (Top Right): Toy problem results; Adaptive ART, (Bottom Right): War and Peace results.

Experimentation was conducted on several toy problems and two datasets: the novel War and Peace, and the language modelling dataset Wikitext-2. In the toy problem, the basic STPNt and STPNt with context were benchmarked against similar models in the literature, as well as the Transformer and Linear Transformer. The results presented in fig. 1 show that the STPNt outperforms the majority of models on both tasks, but performs worse on the toy problem designed to better mimic retrieval in a natural language context. The STPNt with context was introduced here, and showed superior performance to the basic STPNt. This discussion was extended to a language corpus, the War and Peace novel, which recapitulated the importance of context, with the contextual STPNt greatly outperforming the basic model. The experimentation on War and Peace also revealed another insight: stacking STPNt attention blocks caused training instabilities - instead we discovered that the optimal network architecture involved a single STPNt block stacked with Linear Transformer blocks. Finally, experimentation was conducted using the Wikitext-2 dataset, where we also introduced the final model variant: STPNt-LSTM.

Figure 2: Test perplexities for different sequence lengths of training and evaluation

During these experiments, we varied the training and evaluation sequence lengths, and compared the perplexity of the trained models on the test set, shown in figure 3. Surprisingly, the STPNt with context did not provide much benefit over the Linear Transformer, and in fact showed signs of overfitting; lower validation perplexity than the Linear Transformer but higher test perplexity. Nonetheless, the STPNt with context did outperform on sequence lengths beyond 256, and this could be an avenue for further work. The STPNt-LSTM yielded much more positive results: consistently lower training and validation perplexity compared to the Linear Transformer, and demonstrated the ability to utilize longer sequence lengths compared to the Linear Transformer and even the Transformer itself. In fact, the STPNt-LSTM was the only model to achieve a test perplexity improvement from increasing the sequence length from 64 to 128. Knowing that the LSTM cell endows the model with the ability to learn longer sequences, future work could explore the use of an even larger cell state, to see if this could improve things further.

Figure 3: Wikitext-2 best test and validation perplexities for different training/evaluation sequence lengths

Plastic Parameter Analysis:

We try to analyse what the STPNt-LSTM is learning by understanding what plastic parameters (lambda, gamma) the model uses when evaluating a a short sequence of text. We include this section to help improve the interpretability of the model, in the same vein as work done by Kaparthy et al. Using some sample text, we compare the lambda (assigned to modulate the impact of the previous information) and gamma (modulates the impact of current token) for each token in the sequence.

Figure 4: Analysis of STPNt-LSTM weights

Through analyzing the results, shown in fig. 4, we see that in fact the lambda parameters are almost unused; for most timesteps, the value of lambda is very close to one, which indicates an incredibly slow decay rate, which allows for long context. The model does not reset the sequence at any point, but since the sequence is only 128 tokens long, additional analysis needs to be done on longer text, specifically between different topics and articles. One token that seems to encourage more forgetting of the past are quotation marks, shown by the small dips in the lambda graph. What is more surprising is the model's use of gamma parameters. Here we see that the tokens that the model retains the most correspond to tokens such as full stops, "in", "as" which are words that appear generally in text, not providing much information about the subsequent tokens. This may be the case due to the small dataset size of Wikitext-2. The model may not have encountered some words such as names - "Josh" many times, thus not encoding much information in the embeddings of those words, rendering them less useful. However, this is just conjecture, and more work must be done, training on larger corpuses, such as Wikitext, or the 1 billion words corpus.

We finish our summary by noting several avenues for the continuation of the work in this thesis. Firstly, the models used in the experiments are relatively small compared to those in the literature, and the performance statistics listed are far from being state-of-the-art, even among non pre-trained models. Thus, one important next step would be to scale up our models, preferably to around 24M trainable parameters which is comparable to the literature, and incorporate any important regularization techniques. This will be crucial for determining whether our model can truly deliver some benefit. Furthermore, it was discovered upon examining our models parameters during evaluation that the STPNt-LSTM is learning some counter-intuitive properties of the data, and further work should be done training it on larger language corpuses to explore the consequences of this. Finally, by changing our embedding scheme to a relative one, we could utilize a "cache" during evaluation time, and this could have interesting effects on the STPNt performance.

This section will talk about the 2 Factor Authentication (2FA) used within the app and the basic security measures used to make the app more secure, and to protect users. The entire process is very simple, and starts off by concatenating 6 random integers into a code and sending this to the user's email address, whilst simultaneously prompting the user to enter the aforementioned code. Then, these can simply be compared - resulting in successful login if they are same, and rejecting sign in if otherwise. In order to add 2FA to the app, discovery work was done into sending emails using the built in Python library, smtplib. In particular, I used the class smtplib.SMTP_SSL(); the code for the mail sending function is shown below.

I will now briefly describe the process carried out by this code.
A link is established to a Simple Mail Transfer Protocol (SMTP) server, here the specific one used is owned by Google, the address being: "smtp.gmail.com". After using method .login(), and providing correct details to a valid gmail account, an email can be sent. The email and other details, such as recipient's address is sent over to the host address corresponding to the SMTP server. The SMTP server will then contact the DNS server in order to find the recipient's IP address.The SMTP server finally sends the information to email recipient's Mail Transfer Agent (MTA) server. Then, using either POP or IMAP protocols (depending on recipients email client) the email will arrive at the recipient's inbox. The whole process is done using Secure Sockets Layer (SSL) protocol, which creates an encrypted connection, making sure that in the unlikely case the email is intercepted midway, hackers cannot access the information within the email.

Aside from 2FA, the app requires users to create a password that adheres to standard requirements:

Password must be at least 8 characters in length.
Password must contain a Capital letter.
Password must contain a special character.
Password must contain a number.

If the user fails to meet these requirements, the app will reject the password and prompt the user to strengthen the password.

Additionally, passwords that are stored on the database are hashed using SHA-256. This adds extra steps for a hacker that gets access to the database. If they try to brute force the password, they will need to apply the hash every attempt in order to find the correct password, which is more time consuming. The algorithm is described in more detail in the next section.

This section will briefly go over the Secure Hashing Algorithm(SHA) - 256, implemented from scratch to be used to hash passwords in the Login app. Information in this section can be found in its entirety here, in the pseudocode section.

Below, you can see the main method implemented in my class, SHA2, called .encrypt().

Walking through the steps within this method, first there is a method called binary conversion. This simply takes the password string, and converts each character to binary. Next, the binary string is "padded" with zeros and the message length (also in binary) so that we end up with a 512-bit binary string.

The next two steps are very related, so they shall be described together. In SHA-256 hashing, there are 64 so called W values, which are used in "rounds". The first 16 W values, labelled 0 to 15, are created using your initial message, splitting up the 512-bits into 16 x 32-bit chunks. Then, after these have been obtained, the algorithm will proceed to iterate 64 minus 16 times to "generate" the rest of the W values. This is using the formula below:

Which uses functions for right rotate (switching order of bits) and right shift (removing bits and padding with zeros), as well as existing W values to find new W values. After this is completed, 64 "rounds" will commence.

The diagram of a single round is shown below.

A to H at the top represent "working variables". Using a number of different functions: maj, conditional, Sigma, combined with our W values and K values (created using cube roots of first 64 prime numbers) using modular addition, these working variables are then used to initialize the working variables of the next round. This is repeated 64 times to retrieve the final set of working variables. Adding these to the initial working variables and converting to hexadecimal, and finally concatentating, we retrieve the SHA256 hash.

Unittests were written to test the validity of hash function.

PERSONAL PROJECTS