Apply Now Apply Now Apply Now
header_logo
Post thumbnail
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Building LLMs for Code Repair

By Vishalini Devarajan

Code breaks. It breaks during development when requirements change, during maintenance when dependencies update, and in production when edge cases appear that no one anticipated. Fixing broken code is one of the most time-consuming parts of software development and one of the least satisfying.

LLMs for code repair change this. Instead of a developer manually reading through stack traces, tracing execution paths, and rewriting broken logic, a model trained specifically for repair tasks can identify the fault, understand the context, and generate a corrected version of the code automatically.

Building these models is not the same as building general-purpose code generation models. Code repair requires a different kind of understanding: the model must reason about what the code was supposed to do, what it actually does, and what change would bring those two things into alignment.

In this article, let us understand what LLMs for code repair are, how they are built, what makes them different from general coding models, and how to design training and evaluation pipelines that produce models capable of genuine repair rather than plausible-looking rewrites.

Table of contents


  1. TL;DR
  2. How Code Repair Works Today
  3. The Real Problem: Repair Is Not Generation
  4. The Shift: Training for Repair Rather Than Generation
  5. How to Build an LLM for Code Repair
    • Component 1: Dataset Construction
    • Component 2: Model Selection and Fine-Tuning
    • Component 3: Inference Design
    • Component 4: Execution-Based Evaluation
  6. Why Repair-Specific Training Is More Effective
  7. Fault Context as a Crucial Enabler
  8. Conclusion
  9. FAQs
    • What are LLMs for code repair?
    • How is code repair different from code generation?
    • What data is needed to train a repair model?
    • How should repair models be evaluated?
    • Is it better to fine-tune a model or prompt a general model?
    • Can repair models fully replace human code review?

TL;DR

1. LLMs code repair models are trained to fix broken code rather than generate new code from scratch, requiring a different training approach and evaluation methodology.

2. Effective repair models are trained on paired datasets of buggy and fixed code alongside fault context such as error messages, test failures, and stack traces.

3. The repair task requires models to reason about program semantics and intended behavior, not just pattern-match to a correct-looking output.

4. Evaluation of repair models must use execution-based metrics that verify the fix actually works, not surface-level similarity to a reference patch.

5. Fine-tuning a general-purpose code LLM on repair-specific data consistently outperforms prompting the same model without repair-focused training.

What Are LLMs for Code Repair?

LLMs for code repair are language models trained specifically to identify faults in existing code and generate corrected versions. Unlike general-purpose code generation models that produce new code from descriptions, repair models take broken code as input and output fixed, working code.

How Code Repair Works Today

Current approaches to automated code repair fall into two broad categories. The first is static analysis combined with predefined fix templates. A tool identifies a pattern that matches a known bug class and applies the corresponding fix. This works reliably for a narrow set of fault types but fails completely on anything outside its template library.

The second is using general-purpose LLMs with prompting. A developer pastes broken code and an error message into a chat interface and asks the model to fix it. The model produces a plausible response but has no guarantee of correctness because it was not trained specifically for repair tasks and has no mechanism to verify that its output actually resolves the fault.

Both approaches produce unreliable results for the same underlying reason. Neither model has learned to reason specifically about the relationship between faulty code, its intended behavior, and the minimal change that restores correct behavior. LLMs built specifically for code repair address this gap directly.

The Real Problem: Repair Is Not Generation

The fundamental challenge in building LLMs for code repair is that repair and generation are different cognitive tasks. Generation starts from a specification and produces code. Repair starts from existing code, identifies a fault, and produces a targeted modification. A model trained primarily on generation learns to produce code that looks correct. A model trained for repair learns to distinguish correct from incorrect and to make the specific change that moves code from one state to the other.

When a general-purpose model is asked to repair code it was not trained to repair, it tends to rewrite rather than repair. It produces a new version of the code that may be functionally equivalent, may be better structured, or may introduce new bugs while appearing to fix the original one. The output looks like a repair but is not verified as one.

This distinction matters because repair in production contexts requires precision. A change that fixes one bug while introducing another is not a repair. It is a replacement that shifts the problem rather than solving it.

MDN

The Shift: Training for Repair Rather Than Generation

Building LLMs specifically for code repair means constructing training data, model architecture choices, and evaluation pipelines around the repair task rather than adapting them from generation workflows.

The training data for a repair model is structured differently from the generation training data. Each example pairs a faulty version of code with a corrected version, and ideally includes the fault context: the error message, the failing test, the stack trace, or the natural language description of the observed behavior. The model learns to map from this rich fault context to the targeted fix.

This shift in training data structure is what produces a model that genuinely repairs rather than one that generates plausible-looking alternatives.

How to Build an LLM for Code Repair

Building an LLM for code repair involves four interconnected components: dataset construction, model selection and fine-tuning, inference design, and execution-based evaluation. Each component must be designed with the repair task in mind rather than adapted from a generation pipeline.

Component 1: Dataset Construction

The training dataset is the most important factor in repair model quality. A high-quality repair dataset contains paired examples of faulty and fixed code, with the fault context included as part of the input. Sources for this data include open-source version control history, bug tracking systems, competitive programming submissions with their corrections, and automated fault injection into correct code.

// Training example structure{  “fault_context”: {    “error_message”: “TypeError: Cannot read property ‘id’ of undefined”,    “stack_trace”: “at getUserData (user.js:14)”,    “failing_test”: “getUserData should return null for missing user”  },  “buggy_code”: “function getUserData(userId) {\n  const user = users.find(u => u.id === userId);\n  return user.name;\n}”,  “fixed_code”: “function getUserData(userId) {\n  const user = users.find(u => u.id === userId);\n  return user ? user.name : null;\n}”}

The quality of fault context significantly affects repair accuracy. Models trained with rich fault context, including error messages and failing tests alongside the buggy code, consistently outperform models trained on buggy and fixed code pairs alone.

Component 2: Model Selection and Fine-Tuning

The base model for a repair LLM should be a code-specialized foundation model rather than a general language model. Models pre-trained on large code corpora already understand programming language syntax, common patterns, and the relationship between code structure and behavior. Fine-tuning from this starting point requires significantly less repair-specific data than training from a general base.

Fine-tuning on repair data should use the fault context as the model input and the corrected code as the target output. The loss function should weight the changed tokens more heavily than the unchanged ones, encouraging the model to learn what to modify rather than to reproduce the entire file.

# Fine-tuning configuration for repair tasktraining_args = TrainingArguments(    output_dir=’./repair-model’,    num_train_epochs=3,    per_device_train_batch_size=4,    learning_rate=2e-5,    warmup_steps=500,    weight_decay=0.01,    logging_dir=’./logs’,    evaluation_strategy=’steps’,    eval_steps=500) # Format input with fault contextdef format_repair_input(example):    return {        ‘input’: f”Fix this bug:\nError: {example[‘error_message’]}\n\nCode:\n{example[‘buggy_code’]}”,        ‘target’: example[‘fixed_code’]    }

Component 3: Inference Design

At inference time, the model receives the fault context and the buggy code and generates a repair candidate. Because repair correctness cannot be determined from the model output alone, the inference pipeline should generate multiple candidates and rank them by likelihood before passing them to execution-based verification.

Sampling strategies that produce diverse candidates, such as nucleus sampling with a moderate temperature, outperform greedy decoding for repair tasks because the correct fix is not always the most likely token sequence from the model’s perspective. Diversity in the candidate set increases the probability that at least one candidate passes verification.

# Generate multiple repair candidatescandidates = model.generate(    inputs,    max_new_tokens=512,    num_return_sequences=5,    do_sample=True,    top_p=0.95,    temperature=0.8) # Rank candidates by model confidencescored = [(model.score(c), c) for c in candidates]ranked = sorted(scored, reverse=True)

Component 4: Execution-Based Evaluation

The only reliable way to evaluate a repair model is to run the repaired code and check whether it behaves correctly. Surface-level metrics such as BLEU score or edit distance to a reference patch measure similarity to a known fix but do not verify that the generated fix actually works. A repair that is syntactically different from the reference patch but passes all tests is a valid repair. A repair that matches the reference patch but fails a test that the reference patch was supposed to fix is not.

def evaluate_repair(original_code, repaired_code, test_suite):    # Check that the repair compiles    try:        compile(repaired_code, ‘<string>’, ‘exec’)    except SyntaxError:        return {‘valid’: False, ‘reason’: ‘syntax_error’}     # Run the test suite against the repaired code    results = run_tests(repaired_code, test_suite)     return {        ‘valid’: results.all_passed,        ‘tests_passed’: results.passed,        ‘tests_failed’: results.failed,        ‘regression’: results.regressions    }
💡 Did You Know?

LLMs fine-tuned for code repair using execution-based feedback during training produce fixes that pass test suites at significantly higher rates than the same base models used without repair-specific tuning.

This effect can hold true even when the base model is larger, showing that targeted training on debugging and correction tasks can be more important than raw model size alone.

Why Repair-Specific Training Is More Effective

General-purpose code models see repair examples during pre-training as part of the broader code corpus, but they see them mixed with generation examples, documentation, and other code-related text. The model learns that repair is one of many things code can be about, rather than developing the specific reasoning pattern that repair requires.

Fine-tuning on repair-specific data reorganizes what the model has learned around the repair task. The input structure, fault context, and t followed by buggy code consistently precedes a targeted correction. The model internalizes the relationship between these components and applies it reliably at inference time. The result is a model that approaches a piece of broken code as a repair problem rather than a generation problem.

Fault Context as a Crucial Enabler

The difference between a model that repairs and a model that rewrites is the fault context. A model that receives only the buggy code must infer what is wrong from the code itself. This is possible for simple, obvious faults, but fails for faults that depend on the runtime behavior, the test expectations, or the user-reported symptoms.

A model that receives the error message, the failing test, and the stack trace alongside the buggy code has the information it needs to reason specifically about the fault rather than about the code in general. This reasoning specificity is what produces targeted repairs rather than broad rewrites.

Fault context is the most important structural element in a repair training dataset. Collecting it, cleaning it, and including it consistently in both training and inference inputs is the investment that most directly improves repair model quality.

If you want to learn more about building skills for Claude Code and automating your procedural knowledge, do not miss the chance to enroll in HCL GUVI’s Intel & IITM Pravartak Certified Artificial Intelligence & Machine Learning courses. Endorsed with Intel certification, this course adds a globally recognized credential to your resume, a powerful edge that sets you apart in the competitive AI job market.

Conclusion

Building LLMs for code repair requires treating repair as a distinct task with its own data requirements, training objectives, and evaluation methodology. The models that perform best are fine-tuned on paired datasets of buggy and fixed code with rich fault context, evaluated using execution-based metrics that verify correctness rather than surface similarity, and deployed in pipelines that generate and rank multiple candidates before verification.

Through the combination of repair-specific training data, fault context as a first-class input, and execution-based evaluation as the primary quality gate, LLMs code repair produces verified fixes rather than plausible suggestions. If a repair model cannot verify its own output through execution, it is a generation model being used for repair and will produce the unreliable results that come with that mismatch. Reliable automated repair starts when the model is trained for repair, and the pipeline is built to confirm it.

FAQs

1. What are LLMs for code repair?

They are language models trained specifically to identify faults in existing code and generate corrected versions, distinct from general-purpose code generation models that produce new code from descriptions.

2. How is code repair different from code generation?

Generation starts from a specification and produces code. Repair starts from broken code, identifies a fault, and produces a targeted modification that resolves the fault without altering correct logic elsewhere.

3. What data is needed to train a repair model?

A high-quality repair dataset contains paired examples of buggy and fixed code alongside fault context such as error messages, failing tests, and stack traces. The fault context is the most important element for producing targeted repairs rather than broad rewrites.

4. How should repair models be evaluated?

Repair models must be evaluated using execution-based metrics that run the repaired code and verify that it passes the test suite. Surface-level similarity metrics like the BLEU score measure proximity to a reference patch but do not verify that the generated fix actually works.

5. Is it better to fine-tune a model or prompt a general model?

Fine-tuning on repair-specific data consistently outperforms prompting a general model for repair tasks. Fine-tuned models internalize the repair reasoning pattern and apply it reliably, while prompted models approach repair as one instance of general code generation.

MDN

6. Can repair models fully replace human code review?

No. Repair models perform reliably on fault types well-represented in their training data but require human review for security-critical code paths, untested logic, and domain-specific faults outside their training distribution.

Success Stories

Did you enjoy this article?

Schedule 1:1 free counselling

Similar Articles

Loading...
Get in Touch
Chat on Whatsapp
Request Callback
Share logo Copy link
Table of contents Table of contents
Table of contents Articles
Close button

  1. TL;DR
  2. How Code Repair Works Today
  3. The Real Problem: Repair Is Not Generation
  4. The Shift: Training for Repair Rather Than Generation
  5. How to Build an LLM for Code Repair
    • Component 1: Dataset Construction
    • Component 2: Model Selection and Fine-Tuning
    • Component 3: Inference Design
    • Component 4: Execution-Based Evaluation
  6. Why Repair-Specific Training Is More Effective
  7. Fault Context as a Crucial Enabler
  8. Conclusion
  9. FAQs
    • What are LLMs for code repair?
    • How is code repair different from code generation?
    • What data is needed to train a repair model?
    • How should repair models be evaluated?
    • Is it better to fine-tune a model or prompt a general model?
    • Can repair models fully replace human code review?