Paper deep dive
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau
Models: GPT-J-6B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/12/2026, 7:55:13 PM
Summary
The paper introduces 'Future Lens', a method to extract and visualize information about future tokens (beyond the immediate next token) encoded within a single hidden state of a transformer model (GPT-J-6B). The authors demonstrate that hidden states contain sufficient signal to predict subsequent tokens with significant accuracy using linear approximation, fixed prompt causal intervention, and learned prompt causal intervention.
Entities (4)
Relation Signals (3)
GPT-J-6B → trainedon → The Pile
confidence 100% · We perform evaluation on samples of the Pile (Gao et al., 2020), which is the 825GB dataset used to train GPT-J-6B
Learned Prompt Causal Intervention → evaluatedon → GPT-J-6B
confidence 95% · We perform evaluation on samples of the Pile... used to train GPT-J-6B... we next explore an approach in which we learn to surface information... from individual contextual token embeddings.
Future Lens → utilizes → Learned Prompt Causal Intervention
confidence 95% · We apply Learned Prompt Intervention to create a novel probing tool we call the Future Lens.
Cypher Suggestions (2)
Identify datasets associated with the model · confidence 95% · unvalidated
MATCH (llm:LLM {name: 'GPT-J-6B'})-[:TRAINED_ON]->(d:Dataset) RETURN d.nameFind all methods used to analyze the LLM · confidence 90% · unvalidated
MATCH (m:Method)-[:EVALUATED_ON]->(llm:LLM {name: 'GPT-J-6B'}) RETURN m.nameAbstract
Abstract:We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.
Tags
Links
- Source: https://arxiv.org/abs/2311.04897
- Canonical: https://arxiv.org/abs/2311.04897
- Code: https://future.baulab.info
Full Text
42,760 characters extracted from source content.
Expand or collapse full text
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State Koyena Pal Northeastern University pal.k@northeastern.edu Jiuding Sun Northeastern University sun.jiu@northeastern.edu Andrew Yuan UMass Amherst awyuan@umass.edu Byron C. Wallace Northeastern University b.wallace@northeastern.edu David Bau Northeastern University d.bau@northeastern.edu Abstract We conjecture that hidden state vectors cor- responding to individual input tokens encode information sufficient to accurately predict sev- eral tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) repre- sentation of a single token at positiontin an input, can we reliably anticipate the tokens that will appear at positions≥t+ 2? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to pre- dict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model’s output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a “Future Lens” visualiza- tion that uses these methods to create a new view of transformer states. 1 Introduction Do hidden states in large language models (LLMs) encode tokens farther than a single token ahead? If so, how can we decode this sequence of tokens from a single state? In this work we empirically investigate these questions using GPT-J-6B (Wang and Komatsuzaki, 2021). We train models to pre- dict hidden states several tokens ahead of a given positiontbasedonlyon a contextualized represen- tation of the input at this position. Auto-regressive transformer language models are typically trained to predict one token ahead, but recent work has hinted that individual hidden states may contain more information than just probabili- ties of the following token. For example, Menget al.(2022a) trace information flow from subject to- kens to associated attribute predictions many steps ahead. Elsewhere, Gurneeet al.(2023) suggest that neurons in early layers are dense with informa- tion, while middle layers have dedicated neurons that represent high-level contextual features. Other related efforts have passed hidden interme- diate states directly to the decoder head (skipping in-between layers) to “verbalize” such embeddings (Din et al., 2023; Belrose et al., 2023; nostalge- braist, 2020). Studies of memorization (Carlini et al., 2021, 2023, 2019) have identified the pres- ence of very long memorized sequences generated by language models, and Zhang and He (2020) shows that progressively dropping layers during computation can still achieve a similar prediction output of the model when compared against their fully computed model run. In this work we ask: To what extent can we ex- tract information about future (beyond subsequent) tokens from a single hidden token representation? To answer this, we conduct three experiments. First, extending the ideas of Tuned Lens (Belrose et al., 2023; Din et al., 2023) and the Logit lens (nostal- gebraist, 2020), we train linear models to approxi- mate future model predictions several tokens in the future, in order to reveal the extent to which individ- ual hidden states may directly encode subsequent tokens. Second, we perform a causal intervention study in which we transplant individual hidden states from one context to a completely different context and measure the extent to which future to- kens that were predicted in the original context can be predicted in the foreign context. Finally, we fit a “soft prompt” to explicitly learn an optimal prompt that permits reading out information about subse- quent tokens from a hidden state. Our code and data is available athttps://future.baulab.info 2 Methods To unveil the information about “future” tokens implicitly encoded in a single transformer state vector, we develop and compare several methods for predicting future tokens from a single hidden state. Each of our methods has the same goal: Ex- tract accurate predictions of a model’s probability distribution several tokens ahead, based on the in- 1 arXiv:2311.04897v1 [cs.CL] 8 Nov 2023 formation in only one hidden state at a single layer at one token of the transformer. For our evaluations we use an autoregressive transformer (Vaswani et al., 2017) language model defined as a functionG:X→Yover vocabu- laryVof size|V|=d v .Gtakes in a sequence of tokensx= [x 1 ,....,x T ]∈X,x i ∈Vand maps this to a probability distributiony T ∈Y⊂[0,1] d v , which (greedily) predicts the next-tokenx T+1 = argmaxy T . To generate additional tokens, the top predicted tokenx T+1 is added to the sequence of to- kens[x 1 ,....,x T ,x T+1 ]and the process is repeated until the nextNtokens are produced. To calculate each predicted probability distri- bution from an input sequencex, the transformer performs a sequence of computations atLlayers; this can be decomposed as: G(x) =D(b L (·(b 2 (b 1 (E(x))))·))(1) Where the first stepE:→R d h embeds each input token into an initial hidden representation,e(x i ) = h 0 i ∈R d h ; each layerb l :R d h ×T →R d h ×T trans- forms the sequence of representations; and the de- coderD:R d h →Ydecodes the predicted proba- bility distributiony T =D(h L T )from the last layer at the last token. We write the output of layerlas H l =b l (H l−1 ), where: H l = (h l 1 ,...,h l T )∈R d h ×T (2) When generating a sequence of tokens beyond the given starting prefix of lengthT, we write: y T+i =G([x 1 ,..,x T+i−1 ,x T+i ])(3) x T+i+1 = argmaxy T+i (4) Our goal is to devise methods that can anticipate whatGwill predict fory T+1 throughy T+N from only a single hidden state ath l T . 2.1 Direct Vocabulary Prediction Leth l T denote the hidden representation induced byGfor tokenx T at intermediate layerl≤L, and lety T+N denote the subsequent-token distribu- tion predictions produced byGafter tokenx T+N . To predicty T+N fromh l T alone, we train a linear modelg θ to predict logitsˆz T+N that approximate ˆy T+N after softmax: ˆz T+N =g θ (h l T )(5) ˆy T+N = softmax(ˆz T+N )≈ˆy T+N Since this model directly predicts the subsequent predictions over the full vocabulary fromh l T , we call it the direct vocabulary prediction model. ... Figure 1: LLM to Linear Model Approximation Overview. Given a hidden state,h l T , the linear model, f θ , is trained to output a future hidden stateh L T+1 . In this example,h l T is the encoding that would lead to the prediction of ‘New,’ andf θ uses only that information to predicth L T+1 that would predict ‘York.’ 2.2 Linear Model Approximation We also test a linear model based on the tuned logit lens (Belrose et al., 2023; Din et al., 2023) approach, which anticipates future hidden states within the transformer and decodes them using the pretrained decoder head. Differently from that work, we model hidden states at future tokens in rather than only at later layers. Beginning with the hidden representationh l T , we create a model to predict a hidden stateh L T+N at the final layerL, and subsequent tokenx T+N . To predicth L T+N fromh l T , we train a linear model: ˆ h L T+N =f θ (h l T )≈h L T+N (6) The vocabulary can be read from the predicted ˆ h L T+N by applying the pretrained decoder head of the transformer. In Figure 1, we show an ex- ample of one such linear model. Suppose that we have trained a linear model parameterized byθ,f θ , that takes in the last token hidden representation of the input at layerlto generate a hidden state at layerLof the following token hidden representa- tion. When we input the following inG: “Madison Square Garden is located in", we get “New" as the highest-probability prediction atN= 0and “York" 2 atN= 1. We use the linear model to approximate this based on the hidden representation ofT N (i.e., “in") at layerl≤Las our input; the ideal output of the linear model given this would be the hidden state atT N+1 and layerL, which is associated with predicting “York” as the most probable token. This approach differs from the direct vocabulary approach by reusing the pretrained decoder head of the transformer. We find that this marginally aids predictions at the latest layerslnearL. Based on the observation that other pretrained transformer parameters may encode memorized calculations that facilitate decoding of subsequent tokens, we next turn to other approaches that utilize larger portions of the pretrained transformer to predict future tokens. 2.3 Fixed Prompt Causal Intervention The next method we consider involves a single- state causal intervention where we transplant the hidden stateh l T into the transformer while it is de- coding an unrelated bit of context. The question is whether this transplantation steers the model to gen- erate tokens related to the prefix that inducedh l T . If it does, this indicates that information about sub- sequent tokens (in the original sequence) is promi- nently encoded inh l T . Figure 2 depicts the procedure. On the left, we show the original context from whichh l T is read; herex= [x 1 ,...,x T ]is “Madison Square Garden is located in" wherex 1 is “Madison" and x T is “in". This results in a sequence of outputs [x T+1 ,...,x T+N ]which will read “New York City.” On the right, we run a single generic fixed-context promptc= [c 1 ,...,c M ](e.g., “Please, tell me something about" wherec 1 is “Please" andc M is “about") through the transformer. One would not anticipate that this generic prompt would cause the transformer to predict “New York City”. Using an intervention, we now directly test that hypothesis that a single hidden state at layerland tokenTwithin the original run contains the infor- mation necessary to predict subsequent tokens. We transplant the original run’s state vectorh l T into the corresponding locationh l M in the fixed-context run, then allow the transformer to proceed. If the nec- essary contextual information is present in the new run, the resulting tokens generated would become “New" for the current token generation and “York" and “City” for the subsequent token generations. Formally, let the sequencex= [x 1 ,...,x T ] de- note an input context that causes the model to sub- sequently generate[x T+1 ,...,x T+N ], and let and c= [c 1 ,...,c M ]represent a generic fixed-context prompt whereTandMrepresent the lengths of the original and fixed input prompts, respectively. When each are passed throughG, we get the fol- lowing predicted distributions: y T =G(x)∈[0,1] |V| (7) ˆy ∗ M =G(c)∈[0,1] |V| Denote the intervention that replacesh l M from the fixed-context run with stateh l T from the original run as: ˆy M =G(c||h l M :=h l T )(8) If, after the intervention, the new predicted distribu- tionˆy M ≈y M approximates the prediction in the original context, that will reveal thath l T specifically encodes information needed for that prediction. Furthermore, we can deduce whath l T encodes about subsequent token predictionsnsteps ahead by adding the generated tokens to the input and comparing the following predictions: y T+i =G(x+ [x T+1 ,...,x T+N ])(9) ˆy M+i =G(c+ [x T+1 ,...,x T+N ]||h l M :=h l T ) The context promptccould be chosen as any sequence of tokens. In practice, some prompts are more amenable to this intervention than others. In our experiments, we will test a small set of highly generic phrases. 2.4 Learned Prompt Causal Intervention In the previous section, we have described an in- tervention that could reveal information predictive of upcoming tokens encoded in a single hidden state, by steering generation when grafted into com- pletely unrelated contexts. However, in cases where this “fails”, it does not necessarily mean that the hidden state does not encode similar information; it may just be less prominent. To evaluate the degree to which such signal is present in these cases, we next explore an approach in which welearnto surface information about subsequent tokens from individual contextual token embeddings. This procedure is shown in Figure 3. Specifically, we optimize a parameterized prefix, c opt = [c 1 ,...,c M ] to extract this information from 3 ... ... Figure 2: Illustration of Fixed prompt Causal Intervention. The left and right sides represent two different transformer model runs. On the left hand side, we have the original run ofMadison Square Garden ... in New York. We transplant the hidden state,h l T to the other transformer model run, which has a fixed generic context,Tell me something about, as its input. Withh l T replacing the hidden state ath l M , we measure the tendency of this modified transformer run to reveal the probability distribution inh l T . In such cases, it would reveal thath l T was predicting, for instance, ‘New York City.’ ... ... Figure 3: Learned context prompt Causal Intervention Overview. The left and right sides represent two different transformer model runs. The general setup is the same as Figure 2. The difference lies in the context provided in the transformer run on the right hand side. Instead of manually thinking of a context, we provide a learned context to increase the tendency of decoding the subsequent tokens predicted byh l T . We do so by training the context,c, with L KL criterion and the objective to match the subsequent token prediction, such as ‘York’ in this instance. 4 the hidden state. For each decoder layerl, we train the corresponding prefixc (l) opt = [c (l) 1 ,...,c (l) M ]to maximize the probability of the model yielding the exact subsequent phrase after the original context. In particular, we conduct the same causal interven- tion in the hidden statesh l T . We then optimize the probability distribution of the subsequent genera- tion under the learned context to be the same as the original model when all its previous generation is given correctly: argmin KL(ˆy M+N ;y T+N )(10) Where the predicted distributionˆy n is given using the same intervention as described in Eq. 9: ˆy M+n =G([c 1 ,...,c M ,x T+1 ,..,x T+N ] ||h l M :=h l T )(11) We hence optimize this objective with the model frozen and only prefix left to be trained. Notably, our approach is different from the implementation of prefix tuning (Li and Liang, 2021) in the sense that we back-propagate the gradient through the model instead of a temporary MLP, as empirically it produces a significantly better optimized context. 3 Experiments and Results 3.1 Data We perform evaluation on samples of the Pile (Gao et al., 2020), which is the 825GB dataset used to train GPT-J-6B (Wang and Komatsuzaki, 2021) as well as other LLMs. To train the linear models, we sample 100,000 tokens that have an average of518sized-context. Amongst the 100,000 token samples, we use 10,000 of them to train for our learned prompt experiment. For testing our methods, we sample another 1000 tokens that have an average previous context length of535. To simplify our analysis of the degree to which single hidden token representations encode subsequentn-grams, we draw our samples from contexts in which the original transformer model made a correct prediction. More specifically, we randomly sampled train and test data points from the subset of token loca- tions where the autoregressive transformer under consideration correctly predicts the following to- ken. In Table 1, we break down the types of tokens present in the testing data by categorizing the last token (T) of the prefix as well as the generated tokens outputs of GPT-J , through greedy (argmax) decoding, atN= 0,1,2,3with respect to various properties, such as whether they are lower-cased to- kens that start with a space, or are numerical tokens, and so on. 3.2 Evaluation Metrics For evaluation we adopt the same metrics used in prior related work Din et al. (2023), namely Precision@kand Surprisal. Precision@kmeasures the appearance of the top probability token in the output atNtokens ahead we predict from the hidden state with respect to the observed top-ktokens from GPT-J-6B model output. Higher values are better here because these mean the actual token at the corresponding future token was accurately predicted. Surprisal, on the other hand, is the minus log probability according to the GPT-J-6B model out- put of the highest probability token according to the proposed probing methods. Lower is better for this measure because such values imply that the top predicted tokens are deemed probable by the model. 3.3 Experimental Setup Linear ModelWe train two types of linear mod- els — one with an output space of 4096 (the hid- den representation size used by GPT-J-6B), and the other one with 50,400 (the vocabulary space of the same). GPT-J-6B comprises 28 layers. We train 4 instances for each of these layers, one for each different “future” token position we consider (n= 0,1,2,3). As input we accept the source hidden state, i.e.,h l T . Our output is either the hid- den state, i.e.,h L T+N or the decoded output at the position (vocabulary distribution)T+N. Fixed Prompt Causal InterventionThis is an evaluation-only setup where we choose four generic context prompts and perform causal inter- vention on these contexts as shown in Figure 2. The four fixed context prompts that we test are: •Hello! Could you please tell me more about " •The multi-tokens present here are " •The concepts in this hidden state listed are: ( •<|endoftext|> This state is describing about the following concept: The hidden states are gathered from layerlof the last token of the context tokens and are transplanted into the hidden representation of the last token in the generic prompts at the same layerl. 5 PropertiesLast Original Context Token N = 0 N = 1 N = 2 N = 3 Examples Lowercase No Space1214.518.113.113.4‘itability’, ‘aka’, ‘ension’ Lowercase With Space4239.137.138.436.7‘ sense’, ‘ tests’, ‘ punitive’ Uppercase No Space2.42.72.22.81.6‘V’, ’TABLE’, ’SE’ Uppercase With Space1.92.41.11.51.7‘ STAR’, ‘ UK’, ‘ USA’ Token length<457.859.864.359.963.2‘*’, ‘ate’, ‘</’ Token length≥442.240.235.940.537‘ validation’, ‘ Subaru’, ‘ulsion’ Punctuation15.714.517.315.219‘-’, ‘.’, ‘</’ Numerical2.42.71.93.22.8‘1998’, ‘001’, ‘5’ Table 1: Data Frequency of different token properties on the Last Prefix Tokens and GPT outputs at N=0,1,2,3. Each number in the table is a percentage of the test dataset, which is of size 1000. 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 Precision@1 N = 0 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 1 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 2 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 3 Learned prompt Fixed prompt Linear Model Linear Vocab Pile Bi-gram Figure 4: Accuracy (Precision@1) using the transplanted hidden representation. TheN= 0case models immediate next-token prediction, andN≥1are the subsequent-token cases that are the focus of our work. The learned prompt is best able to recover future token information from hidden states of a preceding individual token, with predictive accuracy peaking at middle layers, with more than double the accuracy of a bigram baseline. A linear model predicting the hidden state fares comparably to predicting directly into the output vocabulary. Learned Prompt Causal InterventionWe then compare with trained prompts with the same to- ken length as the fixed prompts. We train a soft prompt for each layerlfrom 1 to 28. Each learned prompt is trained by maximizing the probability of generating the token from the prefix context at the penultimate layer, when the hidden state is trans- planted at layerlat the last token of the soft prompt, in the same way as the fixed prompts are applied. We train a prefix with a length of 10. This method performs best and is our main method. 3.4 Unveiling Subsequent Tokens Figure 4 and Figure 5 illustrate the difference be- tween our method and the baselines. The learned prompt optimized with the objective of predicting the next token (N=1) has the best performance. On average, the precision@1 is 24.8% higher, pre- cision@5 is 25.3% higher, and precision@10 is 25.1% higher than thebestbaseline method. The bigram baseline 1 at N=1 is shown as a horizontal 1 The bigram baseline is collected from 900,000 documents from the Pile dataset. line; the bigram model i achieves 20.1% accuracy. For surprisal, the learned prompt also has the low- est value, which indicates its efficacy at maximally unveiling the information behind the hidden states. 3.5 Contexts of Accurate Predictions To further explore the contexts in which these meth- ods seem better (or worse) able to predict subse- quent tokens, we categorize input token (the last original context token) into eight (non-mutually ex- clusive) categories, shown in Table 3. We report the model accuracies when using layer 14, where the learned prompt model peaks. While all categories of token types are predicted better by the learned prompt than by the linear model, the relative improvement is highest when the last context token is a lowercase token preceded by a space, or a longer token. This suggests that information about how to complete long words may not be immediately accessible by a linear model decoder, but that they can be made accessible by using the parameters of the pretrained model as done by the learned prompt intervention method. 6 151015202528 In-Layer 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Surprisal N=0 Learned prompt Fixed prompt Linear Model Linear Vocab 151015202528 In-Layer 0.0 2.5 5.0 7.5 10.0 12.5 15.0 N=1 151015202528 In-Layer 0.0 2.5 5.0 7.5 10.0 12.5 15.0 N=2 151015202528 In-Layer 0.0 2.5 5.0 7.5 10.0 12.5 15.0 N=3 Figure 5: Average surprisal of the model after transplantation. Again the learned prompt performs best, confirming the presence of subsequent-token information encoded at middle-layer hidden states. LENSN=1N=2N=3 Accuracy LEARNED97.048.4 43.7 46.9 FIXED97.020.830.036.5 HS98.029.219.015.8 VOCAB85.727.519.414.7 Surprisal LEARNED0.6 4.5 4.4 3.9 FIXED0.68.86.55.7 HS0.814.113.213.1 VOCAB0.915.314.414.2 Table 2: Best accuracy and surprisal results for each method.LEARNEDrefers to the Learned Prompt Causal Intervention Method;FIXEDdenotes the Fixed version. HSis the Linear Model variation that predicts Hidden State;VOCAB, is the Linear Model variation that pre- dicts a distribution over the vocabulary directly. We have also observed that the accuracy of pre- dicting subsequent tokens is correlates with the model’s confidence in its next token prediction. In the case ofN= 1, for instance, the learned prompt intervention method’s calibrated accuracy is 26%, 57%, 77%, and 95% for model confidence groups of 0-30%, 30-60%, 60-90%, and 90%-100%, re- spectively. These trends appear inN= 2and N= 3as well. This suggests that we might gain- fully use this decoding method as a probing tool, trusting that predicted future tokens are generally accurate when the model is confident. Does future information appear only in the pres- ence of higher-level concepts? For example, one might hypothesize that in cases the language model predicts an entire named entity, that the probing method might decode future predictions more accu- rately. To investigate this, we performed sub-group analyses on test results to characterize how well the best probing method performed specifically for multi-token named entities. Interestingly, we found little difference: when examining just the named entity cases, we observe similar or slightly lower accuracy: 44%, 42% and 37% forN= 1,2,3, sug- gesting that future information is present broadly, not only for long entity names. In sum, we have found that a single hidden state encodes information about outputs more than one token ahead, and we have demonstrated three dif- ferent methods that can decode them for GPT-J-6B. Application: Future LensWe apply Learned Prompt Intervention to create a novel probing tool we call theFuture Lens. Given a soft prompt, we perform the intervention using the states arising from the user’s prompt to provide a view into what the hidden states encode about future tokens. In Figure 6, we show an example for the prompt: “Marty McFly from". The Future lens reports the anticipated four tokens from every hidden state in the model (across layers). In the Future Lens visualization, every cell represents a hidden state from a particular layer ("Ldigit") at a specific token. The shade of each cell indicates the model’s average confidence with respect to the corresponding token predictions (darker shades indicate greater confidence). For example, at the cell representing the hidden state at Layer 25 at the token “from", we can see that the confidence in the predicted tokens “Back to the Future" is strong. This particular state suggests that the LLM already knows that Marty McFly is related to the Back to the Future movie. Interest- ingly, the model also assumes “Marty” to have the surname Donough. Returning to the predictions at token “from", we see that the early layers seem to first predict locations such as Australia or Boston. However, through future predictions, we can see the model begins to associate Marty McFly with a 7 Last Context Token TypeLinear: Vocab SpaceLinear: Hidden State Fixed Context Learned Context Lowercase No Space21.725.29.232.5 Lowercase With Space26.420.819.251.9 Uppercase No Space29.226.30.023.3 Uppercase With Space26.326.310.531.6 Token length<426.524.921.846.9 Token length≥423.924.418.052.1 Punctuation28.728.716.647.8 Numerical12.516.720.833.3 Table 3: Accuracy of predictingN= 1token ahead (y T+1 , which predictsx T+2 ) based on hidden representation of the last context token(x T ). Results are shown for layerl= 14, where the learned prompt model is most accurate. movie around Layer 6. Hence, through this tool, we can gain further insights about the model’s chain of predictions at every hidden state. 4 Related Work Knowledge Prediction and ManipulationRe- cent works have delved into LLM internals to better understand how such models predict the next to- ken at each computation step. Gevaet al.(2021), for instance, find that the feed-forward layers in transformers operate as key-value memories, al- lowing one to intervene at those layers to modify the next token output (Geva et al., 2022). Frame- works such as ROME (Meng et al., 2022a) and MEMIT (Meng et al., 2022b) scale such manipula- tions to edit knowledge in stored in LLMs. The consensus that has emerged in these papers is that some early-middle and late layer calcula- tions contribute the most to the final predicted to- ken. Tools such as Logit lens (nostalgebraist, 2020) and Tuned lens (Belrose et al., 2023; Din et al., 2023) allow us to look at the top-kvalues of the transformer ateverylayer and token to see early next-token predictions. Katz and Belinkov (2023) used logit lens to visualize semantic information flow in GPT-2 models. While these works primar- ily deal with next-token predictions, Hernandez et al. (2023) shows that specific attributes of an en- tity can be extracted with an affine transformation on the entity representation long before the LM is actually required to predict the attribute, enabling an “attribute lens” on early layers and early tokens. We aim to characterize how the current hidden state would affect the prediction of both the next token and tokens farther ahead, but unlike Hernandez et al. (2023), we deal with open contexts and are not constrained to certain relations. Early Exit DecodingTo optimize the running time and space requirements of training models, prior work has looked at “early exit” strategies, which usually involves stopping at earlier layers of computation and estimating the final predictions based on those computations (Schuster et al., 2022; Xin et al., 2021; Kong et al., 2022; Zhang and He, 2020; Din et al., 2023). The takeaway from these methods is that it is possible to achieve prediction performance comparable to that observed when all layers are used even when dropping a couple of computational layers for each token. For instance, Din and colleagues (2023) used linear transforma- tions to predict a later layer’s hidden representation from an earlier layer at the same token. This ap- proach was able to preserve∼95% of the full trans- former model outputs on GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2018). This result implies that initial model layers encode informa- tion that to largely determines the final output. In this work we test the limits of this phenomenon by evaluating the degree to which a single hidden state for a token at positionTcan be used to predict tokens multiple steps ahead (i.e., atT+N). Memorization in Language ModelsDue to the potentially sensitive information present in the datasets used to train language models (LMs), past work has investigated what, when, and why memo- rization occurs (Carlini et al., 2021, 2019; Feldman and Zhang, 2020; Lehman et al., 2021), how mem- orization changes as a function of training data size (Carlini et al., 2023; Wei et al., 2022), and how other memorized information can be detected based on model internal states (Haviv et al., 2023). These works have collectively illustrated that there are some text snippets that LMs remember and can output verbatim or in closely paraphrased versions (“approximate memorization”; Ippolito 8 L1 L2 L3 L26 L27 L28 MartyMcFlyfrom Figure 6: The Future Lens applied to the hidden states of GPT-J-6B processingMarty McFly from. Each cell illustrates the most likely sequence of future tokens that the respective hidden state predicts. The darker boxes correspond to higher probabilities/confidence. et al. 2023). Other work (Haviv et al., 2023) has shown that earlier layers of models tend to promote memorized concepts or tokens, while later layers boost model confidence in these tokens. Our paper can be viewed as an extension of this work on investigating memorization of multi-token phrases: we ask whether and to what extent a single model hidden state encodes multi-token information. Prompt TuningPrompt Tuning has emerged as a parameter-efficient method for fitting LMs for new downstream tasks. By freezing the LM and optimizing only the soft prompt parameters, models are able to achieve performance compa- rableto that observed after fine-tuning all param- eters. Liet al.(2021) introduced prefix tuning which entailed training plug-and-play prefix that steers the behavior of the LMs for the downstream tasks. Other work (Wallace et al., 2019) applied a gradient-based method to search for the best dis- crete prompts which enable the model to produce desire generation. Sun and colleagues (2023) train the prefix soft prompt as a way of aligning seman- tically equivalent instructions in latent space. 5 Discussion In this paper, we explored the degree to which we are able to decode multi-token outputs subse- quent to a particular token on the basis of its hidden representation alone. The results in Table 2 and Figures 4 and 5 indicate that the representations en- code such information, to some degree. Among the decoding methods we assessed, learned prompts are best able to predict such future tokens. Both the linear and the learned prompt models achieve better accuracy than the empirical bigram baseline atN= 1. Interestingly, predictive accuracy of the learned prompt model peaks at the middle-layer hidden states, suggesting that subsequent-token in- formation is encoded at those layers; this pattern is very different from the immediate next-token N= 0, in which accuracy peaks at the last layer. The learned prompt model realizes an accuracy sufficient to be potentially useful as a “Future Lens” to provide insights about subsequent token infor- mation contained in hidden states within LLMs. This provides a way to decode a short sequence of tokens encoded in a hidden state, rather than only the single immediate token prediction. 9 Data and Code AvailabilityAll code and data for demo and implementation is made available at: https://future.baulab.info AcknowledgementsThis work was supported by Open Philanthropy and by the National Science Foundation (NSF) award 1901117. We thank the Center for AI Safety (CAIS) for making computing capacity available for this research. References Nora Belrose, Zach Furman, Logan Smith, Danny Ha- lawi, Igor Ostrovsky, Lev McKinney, Stella Bider- man, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying memorization across neural lan- guage models. InThe Eleventh International Confer- ence on Learning Representations. Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Eval- uating and testing unintended memorization in neu- ral networks. InProceedings of the 28th USENIX Conference on Security Symposium, SEC’19, page 267–284, USA. USENIX Association. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Ex- tracting training data from large language models. In USENIX Security Symposium. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- standing.CoRR, abs/1810.04805. Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. 2023. Jump to conclusions: Short- cutting transformers with linear transformations. Jeffrey L. Elman. 1990. Finding structure in time.Cog- nitive Science, 14(2):179–211. Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling. Mor Geva, Avi Caciularu, Guy Dar, Paul Roit, Shoval Sadde, Micah Shlain, Bar Tamir, and Yoav Goldberg. 2022. LM-debugger: An interactive tool for inspec- tion and intervention in transformer-based language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21, Abu Dhabi, UAE. Association for Computational Linguistics. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key- value memories. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. Adi Haviv, Ido Cohen, Jacob Gidron, Roei Schuster, Yoav Goldberg, and Mor Geva. 2023. Understand- ing transformer memorization recall through idioms. InProceedings of the 17th Conference of the Euro- pean Chapter of the Association for Computational Linguistics, pages 248–264, Dubrovnik, Croatia. As- sociation for Computational Linguistics. Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. 2023. Linearity of relation decoding in transformer language models. Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christo- pher A. Choquette-Choo, and Nicholas Carlini. 2023. Preventing verbatim memorization in language mod- els gives a false sense of privacy. Michael I Jordan. 1997. Serial order: A parallel dis- tributed processing approach. InAdvances in psy- chology, volume 121, pages 471–495. Elsevier. Shahar Katz and Yonatan Belinkov. 2023. Interpreting transformer’s attention dynamic memory and visual- izing the semantic information flow of gpt. Jun Kong, Jin Wang, Liang-Chih Yu, and Xuejie Zhang. 2022. Accelerating inference for pretrained language models by unified multi-perspective early exiting. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4677–4686, Gyeongju, Republic of Korea. International Com- mittee on Computational Linguistics. Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Gold- berg, and Byron C Wallace. 2021. Does bert pre- trained on clinical notes reveal sensitive data?arXiv preprint arXiv:2104.07762. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190. 10 Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022a. Locating and editing factual asso- ciations in GPT.Advances in Neural Information Processing Systems, 36. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2022b. Mass editing memory in a transformer.arXiv preprint arXiv:2210.07229. nostalgebraist. 2020. interpreting gpt: the logit lens. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Met- zler. 2022. Confident adaptive language modeling. InAdvances in Neural Information Processing Sys- tems. Yixuan Su, Deng Cai, Yan Wang, David Vandyke, Si- mon Baker, Piji Li, and Nigel Collier. 2021. Non- autoregressive text generation with pre-trained lan- guage models. InProceedings of the 16th Conference of the European Chapter of the Association for Com- putational Linguistics: Main Volume, pages 234–243, Online. Association for Computational Linguistics. Jiuding Sun, Chantal Shaib, and Byron C Wallace. 2023.Evaluating the zero-shot robustness of instruction-tuned language models.arXiv preprint arXiv:2306.11270. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems, 30. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial trig- gers for attacking and analyzing nlp.arXiv preprint arXiv:1908.07125. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emer- gent abilities of large language models.Transactions on Machine Learning Research. Survey Certifica- tion. Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie yan Liu. 2023. A survey on non-autoregressive generation for neural machine translation and beyond. Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. InProceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume, pages 91–104, Online. Association for Computational Linguistics. Minjia Zhang and Yuxiong He. 2020. Accelerating training of transformer-based language models with progressive layer dropping. 11 A Appendix Additional Figures In this main paper, we report results based on models that are trained to optimize theN= 1single token-ahead prediction, and we test those models for predictive accuracy for otherN. The same methods can also be used to optimize subsequent tokens, and the results of those methods are shown here. We find that optimizing forN= 1works best and generalizes surprisingly well to otherN, but that that optimizing for otherNdoes not perform well forN= 1. 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 Precision@1 N = 0 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 1 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 2 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 3 Learned prompt Fixed prompt Linear Model Linear Vocab Pile Bi-gram Figure 7: The Precision@1 (Accuracy) of all the methods trained with predicting the currently decoded token (teacher-forcing) 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 Precision@1 N = 0 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 1 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 2 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 3 Learned prompt Fixed prompt Linear Model Linear Vocab Pile Bi-gram Figure 8: The Precision@1 (Accuracy) of all the methods trained with predicting the 1st next token 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 Precision@1 N = 0 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 1 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 2 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 3 Learned prompt Fixed prompt Linear Model Linear Vocab Pile Bi-gram Figure 9: The Precision@1 (Accuracy) of all the methods trained with predicting the 2nd next token 12 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 Precision@1 N = 0 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 1 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 2 151015202528 In-Layer 0.0 0.2 0.4 0.6 0.8 1.0 N = 3 Learned prompt Fixed prompt Linear Model Linear Vocab Pile Bi-gram Figure 10: The Precision@1 (Accuracy) of all the methods trained with predicting the 3rd next token Limitations In our exploration with extracting far future tokens from single hidden states, we have mostly trained and tested on English data whose size, 100,000, is still relatively small compared to the data size that GPT-J-6B was actually trained in. Furthermore, the experiments were only conducted in GPT-J-6B. While the presence of subsequent token information in a single hidden state is evident in this model, it would be more comprehensive to run these experiments in other LLMs. Since there are no specific prior works that focused on decoding far future tokens from a single hidden state, we did not have any prior baselines we would refer to. While we did create a bigram baseline in the case of predicting 2 tokens in the future (N= 1) and also create linear models as a first decoding method, there could be baselines with other architectures like Recurrent Neural Networks (Jordan, 1997; Elman, 1990) and Non-Autoregressive generation (Su et al., 2021; Xiao et al., 2023). Lastly, our experiments were up to 4 tokens in the future, i.e.,N= 0,1,2,3. It would be intriguing to scale and test up to how many tokens in the future does a single state actually encode and predict. 13