Prompts, Rubrics and the Pedagogy of Machines

Mar 11
7 min read

By Matthew Parish

Wednesday 11 March 2026

Large language models are frequently described as though they were mysterious black boxes, capable of producing essays, legal memoranda or software code without any clear explanation of how their apparent knowledge is organised. In reality their learning and refinement follow processes that resemble, in an abstract way, familiar educational methods used for human students. They are trained through vast quantities of text, but they are improved through structured exercises in which prompts test what they know and rubrics assess the quality of their responses. The combination of the two forms the core of modern large language model training and evaluation.

Understanding why this method is used requires an appreciation of how such models learn. A large language model is fundamentally a statistical system trained to predict the next word in a sequence of words. This is not the way humans construct sentences to convey meaning, and hence large language models create text in a fundamentally different way from people: hence this form of intelligence being described as "artificial". During the initial phase of training, sometimes called pre-training, the model is exposed to immense corpora of text drawn from books, articles, websites and other sources. By analysing patterns across billions or trillions of words, the model learns statistical relationships between linguistic units. It learns, for example, that the word “parliament” often appears near discussions of legislation, that “enzyme” tends to occur in biological contexts, and that the grammatical structure of English constrains which words plausibly follow others.

This stage produces a system that has absorbed an enormous amount of linguistic information, but it does not necessarily produce responses that are coherent, accurate or aligned with human expectations. The model has learned correlations rather than judgment. Consequently a second stage of development is required. This stage involves systematic testing of the model’s responses to structured questions or instructions. In this environment prompts function as examinations, while rubrics function as marking schemes.

A prompt, in the context of a large language model, is the instruction or question presented to the model in order to elicit a response. It may be simple or extremely elaborate. For example a prompt might request a short definition of a scientific concept, or it might instruct the model to draft a legal memorandum in a specified style, length and jurisdiction. Prompts are therefore the equivalent of examination questions. They define the task the model must perform and provide the contextual constraints within which it must reason.

A rubric, by contrast, defines how the resulting answer should be evaluated. In human education a rubric might specify that an essay will be assessed for accuracy, logical structure, originality and use of evidence. In the training of large language models, rubrics serve a similar function. They describe what constitutes a strong answer and provide criteria for comparing different outputs.

The interaction between prompts and rubrics is central to the improvement of the model. After a prompt has generated an answer, evaluators or automated systems compare the answer against the rubric. They examine whether the output contains factual errors, whether it follows the instructions contained in the prompt, whether it is logically coherent and whether it is stylistically appropriate for the task. The resulting evaluation is then used to adjust the model’s behaviour through various optimisation techniques.

One of the most widely used mechanisms for this process is reinforcement learning from human feedback. In this method multiple responses to a prompt may be generated. Human evaluators then rank the responses according to rubric criteria such as factual accuracy, helpfulness and clarity. The model is subsequently adjusted so that the responses ranked higher become statistically more probable in future iterations of the model as it evolves. In effect the system is rewarded for producing outputs that resemble those preferred by human evaluators.

This process gradually shapes the behaviour of the model. The statistical patterns learned during pre-training remain the foundation of its knowledge, but the reinforcement process alters how that knowledge is deployed. The model becomes more likely to produce responses that are well structured, cautious about uncertain information and attentive to the instructions contained in the prompt.

There is a clear analogy here with professional education. A student studying law or medicine may read vast quantities of material during the course of their studies. Yet the decisive stages of learning occur when that knowledge is tested through structured questions and evaluated according to defined criteria. Examinations and marking schemes do not merely measure competence; they also guide students towards the kinds of reasoning and expression expected in the profession.

The same principle applies to large language models. Prompts guide the system towards particular forms of reasoning, while rubrics establish the standards against which that reasoning is measured.

Because of this pedagogical structure, the quality of prompts and rubrics becomes extremely important. Poorly designed prompts lead to ambiguous or misleading instructions. Weak rubrics fail to distinguish between strong and weak answers. In both cases the model receives unclear signals about what constitutes good performance.

A good prompt possesses several defining characteristics. The first is clarity of purpose. The prompt must specify precisely what task the model is being asked to perform. If the objective is vague, the resulting answer will inevitably be unfocused. For example asking a model simply to “discuss international law” invites a sprawling and incoherent response. By contrast a prompt that specifies the jurisdiction, the topic, the intended audience and the format of the answer creates a much clearer target.

The second characteristic of a good prompt is contextual framing. Large language models rely heavily on context when generating responses. Providing background information about the scenario allows the model to activate relevant patterns within its training data. A prompt requesting a legal memorandum may therefore specify that the audience is a board of directors, that the jurisdiction is the European Union and that the document should adopt the style of British legal drafting. Each of these contextual signals narrows the field of plausible responses.

A third characteristic is the inclusion of constraints. Constraints may involve length limits, stylistic requirements or particular analytical components that must be addressed. These constraints function as structural scaffolding for the answer. They encourage the model to organise its reasoning rather than producing a loosely connected narrative.

Good prompts also often contain explicit instructions about reasoning. For example they may request that the model set out arguments step by step or distinguish between legal principles and factual assumptions. These instructions guide the internal reasoning process of the model, encouraging outputs that are more methodical and transparent.

If prompts define the task, rubrics define the standards. A well-designed rubric must specify the criteria by which an answer will be judged. These criteria should be sufficiently detailed to distinguish different levels of quality. In an academic context such criteria might include factual accuracy, analytical depth, organisation of the argument and clarity of language. In the evaluation of language models similar categories are used.

An effective rubric also assigns relative weight to different criteria. Not all aspects of an answer are equally important. In a scientific explanation factual correctness may be paramount, whereas in a piece of persuasive writing rhetorical effectiveness may carry greater weight. Weighting ensures that the evaluation reflects the priorities of the task.

Another important feature of a good rubric is the presence of concrete examples. Abstract descriptions of quality can be difficult to apply consistently. By providing examples of what constitutes a strong or weak answer, the rubric helps evaluators apply the criteria in a more consistent manner. Consistency is essential when the resulting evaluations will influence the training of a machine learning system.

Rubrics also encourage transparency in evaluation. When evaluators are required to justify their judgments according to specified criteria, the resulting feedback becomes more structured and informative. This feedback can then be translated into training signals that guide the model’s future behaviour.

Over time the repeated application of prompts and rubrics produces cumulative improvements in the model’s performance. Each evaluation cycle refines the model’s ability to interpret instructions, organise information and produce coherent explanations. The process resembles a continuous examination in which the student gradually internalises the expectations of the examiners.

It is important however to recognise that this pedagogical structure does not confer genuine understanding in the human sense. A large language model does not possess beliefs, intentions or comprehension. It operates by adjusting probabilities within a vast network of parameters. Yet the structured interaction between prompts and rubrics allows these probabilistic systems to approximate forms of reasoning that appear purposeful and disciplined.

For users of large language models, the implications are significant. The quality of the system’s output depends heavily upon the quality of the instructions it receives. A carefully designed prompt can transform a vague or superficial answer into a detailed and well organised analysis. Conversely an imprecise prompt may produce an answer that appears confident but fails to address the user’s true needs.

Similarly the development of rigorous rubrics is essential for organisations seeking to evaluate or fine-tune language models for specialised tasks. Whether the objective is legal drafting, medical explanation or policy analysis, the rubric determines which qualities of the output will be reinforced during training.

The emerging discipline of prompt and rubric design therefore resembles a new form of pedagogy. Engineers and evaluators must think like teachers, constructing questions that reveal knowledge and marking schemes that reward the forms of reasoning they wish the model to emulate. There is no single way of doing this correctly, just as there is no "best" method of a teacher educating a student. It is a work of art more than a science.

In the years ahead this pedagogical framework is likely to become increasingly sophisticated. Prompts will evolve into complex scenarios designed to test subtle forms of reasoning. Rubrics will incorporate detailed metrics for factual reliability, logical coherence and ethical compliance. The goal will remain the same: to guide statistical systems towards outputs that are not merely plausible, but genuinely useful.

The paradox of large language models is that machines built upon probability and pattern recognition must ultimately be educated through methods that resemble the oldest traditions of human learning. Questions are posed, answers are examined and standards are applied. Through this process of continuous examination the model gradually learns not what is true, but what kinds of responses humans recognise as thoughtful, careful and reliable.

---

The author is a trainer of large language models in the fields of legal analysis and advice, and has written several articles on artificial intelligence, large language models and the challenges of training them to replicate human content in contexts where there is inherent ambigutity in human thought processes. He can be contacted at parish.matthew@gmail.com.

Join our mailing list

Prompts, Rubrics and the Pedagogy of Machines

Recent Posts