Learn how the AI code editor Cursor uses online reinforcement learning (RL) to create a brand new Tab model. The new model not only reduces the number of suggestions by 21% but also increases the acceptance rate by 28%, bringing developers a smoother and less intrusive coding experience.
For developers, the pursuit of ultimate productivity is an eternal quest. In the age of AI, a good code editor plays a crucial role. At Cursor, improving developer productivity is the core goal of their team, and a key player in this is Cursor Tab—an intelligent system that predicts a developer’s next move across the entire codebase.
Every time a user types a character or moves the cursor, Cursor Tab kicks in, trying to predict their intent. If the system is confident enough, it will display a suggestion in gray text, which the user can accept simply by pressing the Tab key.
This system handles over 400 million requests per day, which means Cursor has a massive amount of data on which suggestions are gladly accepted by users and which are ruthlessly ignored. This article will reveal how Cursor leverages this valuable data through “Online Reinforcement Learning” to make Cursor Tab smarter than ever before.
Cursor’s approach may be a bit unconventional. Most Large Language Model (LLM) providers are accustomed to using static datasets or paid labelers to train their models, releasing so-called “major version updates” only every few months. But Cursor’s model is different; their team frequently rolls out new fine-tuned models to users daily and uses real-time data for training, truly bringing the evolution of AI to life.
Not Just Smarter, But Knowing When to Shut Up
Any developer has probably had this experience: an AI assistant constantly popping up with irrelevant suggestions, interrupting a smooth train of thought and causing frustration. This is the problem of “Noisy Suggestions.”
Maintaining a high “acceptance rate” for suggestions is crucial. If the acceptance rate is too low, it means the system is pushing too many incorrect suggestions, which is not only unhelpful but also disrupts the developer’s flow state.
To achieve a high acceptance rate, it’s not just about making the model smarter, but more importantly, teaching it when to remain silent. Sometimes, there isn’t enough contextual information in the code for even an AI with perfect knowledge and reasoning abilities to guess the user’s intention. In such cases, the best course of action is to suggest nothing at all.
Ditching the Old Ways: How Cursor Trains AI with “Policy Gradient”
The most intuitive way to filter out those annoying low-quality suggestions might be to train a classifier. For example, research has found that GitHub Copilot used a similar method, using a logistic regression model to score suggestions based on 11 features like programming language and whether the previous suggestion was accepted. When the score fell below a threshold (say, 15%), the suggestion would not be shown.
This method works, however, Cursor wanted a more fundamental and elegant solution. The goal is not to filter out bad suggestions after the model has already generated them, but for the Tab model to not generate bad suggestions in the first place.
This is where the “Policy Gradient” method comes in.
Simply put, policy gradient is a method for optimizing a “Policy” with the goal of maximizing a “Reward.” In this context:
- Policy: The Tab model itself.
- Reward: A score assigned to each action taken by the model.
This algorithm operates like a feedback loop. It allows the model to randomly try different actions (showing or not showing a suggestion), observe which actions lead to high rewards (the user accepted the suggestion), and which lead to low rewards (the user rejected the suggestion). It then positively reinforces the behaviors that lead to high rewards while suppressing those that lead to low rewards.
To this end, Cursor designed a set of reward rules. For example, if the goal is for the model to only make suggestions when the expected acceptance rate is over 25%, the rewards could be set as follows:
- Suggestion accepted: Reward +0.75
- Suggestion rejected: Penalty -0.25
- No suggestion shown: Reward 0
Through this mechanism, the model learns to only make a move when it estimates the acceptance rate to be high enough, in order to maximize its total reward. It internally builds its own judgment model for acceptance rates, and the rest is left to the optimization algorithm.
An Ever-Evolving AI: Cursor’s Online Learning Loop
The policy gradient method has one prerequisite: the data used for training must come from the model that is being optimized. This means that once the model is updated, the data from the old model can no longer be used because it is already “outdated” information.
To obtain the freshest and most effective “on-policy” samples, Cursor must quickly deploy the new model to users, observe its performance, and then immediately feed this new interaction data into the next round of training.
This poses a significant challenge to Cursor’s infrastructure. The entire system requires an efficient loop, from receiving interaction data for a suggestion from a user to that data being used to train the next model, the time must be as short as possible. Currently, it takes Cursor about 1.5 to 2 hours to roll out a new model checkpoint and collect the next round of data. This is already quite fast in the AI industry, but Cursor believes there is still much room for improvement.
Less Interruption, Higher Efficiency: The Impressive Results of the New Tab Model
Using the methods described above, Cursor has successfully trained a brand new Tab model, which is now the default option in the Cursor editor. The results are very significant, as can be seen directly from the data:
| Metric | Old Model | New RL Model | Change |
|---|---|---|---|
| Suggestion Display Rate | Baseline | Baseline | -21% |
| Suggestion Acceptance Rate | Baseline | Baseline | +28% |
In short, the new Tab model has learned to make suggestions at more appropriate times, thereby reducing interruptions by 21%, while the quality of its suggestions is higher, leading to a 28% increase in the acceptance rate.
Cursor hopes that this improvement will significantly enhance the coding experience for developers and plans to continue to deepen the application of these methods in the future. This is not just a model update, but a testament to their team’s commitment to building the ultimate developer tool.


