Every time you send a prompt to an LLM, a set of hidden controls shapes the response. These settings determine whether the output sounds creative or predictable, long or short, repetitive or varied. Most people never touch them, and that is often fine.
But knowing what these controls do gives you a real advantage. You can make outputs more reliable for factual work or more inventive for brainstorming.
The gap between a mediocre result and a great one often comes down to how you use LLMs. That includes the settings behind the scenes. Five settings appear across nearly every LLM provider: temperature, top-p, max tokens, frequency penalty, and presence penalty.
Most users never need to change them. Default values handle everyday tasks well.
But when you need more precision, even small adjustments can produce noticeably better results. That applies to coding pipelines, creative writing projects, and customer service bots alike.
This article explains each setting in plain language, shows when to adjust them, and provides recommended values for common tasks.
Key Takeaways
What LLM Settings Actually Control
When a large language model generates text, it predicts one word (or token) at a time. At each step, the model calculates a probability for every possible next word. Settings like temperature and top-p shape how the model picks from those probabilities.
LLM settings (inference parameters): Controls that adjust how a model selects its next word during text generation. They do not change the model itself, only how it samples from its predictions.
Think of it like a music playlist on shuffle. Temperature controls how adventurous the shuffle is. A low temperature plays your top hits on repeat.
A high temperature pulls from deep cuts and obscure tracks. Top-p sets how big the playlist is in the first place.
These settings exist in ChatGPT (through the API and custom GPTs), Claude, Gemini, and virtually every other LLM. The exact defaults vary by provider, but the concepts work the same way everywhere. Understanding them gives you control over output quality without changing a single word in your prompt.
How Each Setting Works in Practice
Temperature
Temperature is the most commonly adjusted setting. It controls the randomness of the model’s word choices. You will find it in every major LLM API, and OpenAI’s API reference documents it as a value between 0 and 2.
At temperature 0, the model always picks the most probable next word. The same prompt will produce nearly identical outputs every time.
At temperature 1, the model samples more freely across all probable words. This produces varied and sometimes surprising results. Values above 1 (up to 2 on most platforms) push outputs toward increasingly unpredictable territory.
Here is what different ranges feel like in practice:
- 0 to 0.3: Highly focused and deterministic. The model sticks to the most likely phrasing. Good for factual answers, data extraction, and classification tasks.
- 0.4 to 0.7: Balanced. Enough variety to feel natural without going off-topic. Works well for general writing and summarization.
- 0.8 to 1.2: Creative and exploratory. The model takes more risks with word choice. Suited for brainstorming, fiction, and creative copywriting.
- 1.3 to 2.0: Highly random. Outputs can become incoherent at the upper end. Rarely useful outside of experimental contexts.
For most users, temperature is the single most impactful setting to learn. If you only adjust one thing, this is it.
Top-P (Nucleus Sampling)
Top-p works differently from temperature but targets a similar outcome. Instead of scaling all probabilities up or down, top-p cuts off the tail end of unlikely words.
A top-p of 0.9 means the model only considers words that collectively make up 90% of the probability mass. The remaining 10% of least-likely words are excluded entirely. A top-p of 0.1 restricts the model to only the very top candidates.
The effect is subtle but meaningful. Where temperature stretches the entire distribution, top-p trims it. A low top-p eliminates surprising word choices completely, while a low temperature just makes them less likely.
Most providers set top-p to 1.0 by default, which means no words are excluded at all. Adjusting top-p below 1.0 is useful when you want focused outputs without making the temperature so low that text sounds robotic. For tasks like code generation or structured data extraction, a top-p of 0.1 to 0.3 keeps outputs tight and predictable.
Adjust either temperature or top-p, not both at the same time. Changing both creates unpredictable interactions. Start with temperature for broad control, and switch to top-p only if you need finer pruning of unlikely outputs.
Max Tokens (Maximum Length)
Max tokens sets the upper limit on how long the model’s response can be. It does not force the model to write that many individual tokens, it just prevents it from exceeding the number.
This matters for cost control and practical formatting. Each token you generate through an API costs money, and LLM costs are typically calculated per token. Setting a max token limit prevents runaway responses that burn through your budget.
A few things to keep in mind. Max tokens limits only the output, not the input. The total of input plus output cannot exceed the model’s context window.
For reference, GPT-5 supports up to 400,000 tokens, Claude Opus 4.6 supports up to 1,000,000 tokens, and Gemini 2.5 Pro supports up to 1,000,000 tokens.
If you set max tokens too low, responses get cut off mid-sentence. If you leave it at the maximum, you may pay for more output than you need. Setting max tokens to roughly 1.5 times your expected response length is a practical starting point.
Frequency Penalty
Frequency penalty reduces how often the model repeats the same words or phrases. It applies a penalty each time a word appears, and the penalty grows with each additional use.
A value of 0 means no penalty. Values up to 2.0 apply increasingly strong discouragement against repetition.
This setting is most useful for longer outputs where models tend to fall into loops. If you notice the model restating the same point in different words, a moderate frequency penalty (0.3 to 0.8) can break the pattern. This is especially common with outputs longer than 1,000 tokens.
Be cautious, because frequency penalty values above 1.0 often make text sound unnatural. The model starts avoiding common words that naturally repeat in normal writing, like “the” or “is.”
Presence Penalty
Presence penalty is similar to frequency penalty but works as a flat, one-time nudge. Once a word appears in the output, it receives a fixed penalty regardless of how many times it has been used. This encourages the model to introduce new topics and vocabulary rather than circling back.
The practical difference is straightforward. Frequency penalty fights word-level repetition. Presence penalty fights topic-level repetition.
A moderate presence penalty (0.3 to 0.6) encourages the model to explore new ground. This is helpful for brainstorming and open-ended generation.
In practice, presence penalty matters most when you ask a model to generate lists or ideas. Without it, models tend to cluster around similar concepts. A small presence penalty pushes the model toward more diverse suggestions.
Most users can leave both penalties at 0 for everyday tasks. They become valuable primarily in long-form generation or when building prompt engineering techniques into pipelines that handle extended outputs.
Settings at a Glance
The table below summarizes each setting with recommended values for common use cases.
| Setting | Range | Default (Typical) | Factual/Data Tasks | General Writing | Creative Work |
|---|---|---|---|---|---|
| Temperature | 0 – 2.0 | 1.0 | 0 – 0.2 | 0.5 – 0.7 | 0.8 – 1.2 |
| Top-P | 0 – 1.0 | 1.0 | 0.1 – 0.3 | 0.7 – 0.9 | 0.9 – 1.0 |
| Max Tokens | 1 – model limit | Varies | 256 – 512 | 1,024 – 2,048 | 2,048 – 4,096 |
| Frequency Penalty | 0 – 2.0 | 0 | 0 | 0.2 – 0.5 | 0.5 – 0.8 |
| Presence Penalty | 0 – 2.0 | 0 | 0 | 0.1 – 0.3 | 0.3 – 0.6 |
These are starting points, not rules. The right settings depend on your specific task, the model you are using, and the quality of your prompt. A well-crafted prompt with clear instructions often matters more than any setting adjustment.
When Settings Work for You and When They Do Not
Strengths
Temperature and top-p give you meaningful control over output style without changing your prompt. This is useful when you want the same prompt to produce different types of results.
A customer service template at temperature 0.2 stays on-script. The same template at 0.8 sounds warmer and more conversational.
Max tokens provides cost predictability. For API users running thousands of requests, capping output length keeps bills from spiraling. Combined with smart prompt design, it lets you build reliable automated workflows.
Frequency and presence penalties solve a real problem with long-form generation. Models naturally drift toward repetition over long outputs, and these settings counteract that tendency without requiring prompt changes. For writers generating articles or reports, even a small frequency penalty of 0.3 can noticeably reduce circular phrasing.
Limitations
Settings cannot fix a bad prompt. If your instructions are vague, lowering the temperature just makes the model confidently produce the wrong thing. The foundation is always what you write in the prompt itself.
In practice, temperature above 1.2 rarely produces usable results for any real task. The outputs become incoherent quickly. The added randomness does not translate into genuine creativity, which comes from better prompts, not higher temperature.
These settings also do not reduce LLM hallucinations. A model at temperature 0 will still state incorrect facts confidently.
This happens when the answer is not well-represented in the training data. Low temperature makes outputs consistent, not more accurate.
Settings behave slightly differently across providers. Anthropic’s documentation notes that Claude defaults to temperature 1 and recommends adjusting either temperature or top-p, not both.
OpenAI offers the same guidance. These differences are small but worth noting when switching between models.
Low temperature does not mean higher accuracy. A model at temperature 0 still generates text based on statistical patterns, not verified facts. Always validate important claims regardless of the temperature setting used.
Common Misunderstandings About LLM Settings
“Temperature 0 gives the correct answer.”
Temperature 0 gives the most probable answer, which is not the same thing. The model’s training data may contain errors or biases that surface in the most likely prediction. Temperature controls consistency, not truthfulness.
“Higher temperature means more creative.”
Only up to a point. Beyond about 1.0, outputs become random rather than creative. Genuine creativity comes from providing the model with rich context and interesting constraints, not from cranking up randomness.
“You should always adjust top-p and temperature together.”
In practice, adjusting both simultaneously often creates unpredictable results. Most model providers recommend choosing one or the other for any given task, depending on how much control you need.
“Max tokens controls quality.”
Max tokens only controls length. A 500-token response is not inherently better or worse than a 2,000-token response.
Quality depends on the prompt, the model, and how well the task fits the model’s strengths. Comparing different models for specific tasks often matters more than adjusting output length.
“These settings work the same across all models.”
While the concepts are consistent, each provider implements them slightly differently. A temperature of 0.7 on GPT-5 may feel different from 0.7 on Claude Sonnet 4.5. Testing with your specific model is the only way to calibrate.
Conclusion
LLM settings give you a second layer of control beyond your prompt. Temperature is the one most worth learning first. It directly shapes whether outputs feel predictable or exploratory.
Top-p, frequency penalty, and presence penalty offer finer adjustments for specific use cases, but they are not essential for everyday use.
The most important takeaway is that settings are secondary to prompt quality. A well-written prompt at default settings will almost always outperform a vague prompt with carefully tuned parameters. Start with strong prompting fundamentals, then reach for these controls when you need more precision.
To see how these settings play out across different models, comparing which LLM fits your specific use case is a natural next step.