Can Claude Code Really Understand COBOL Applications?

Following Anthropic's market-moving blog post, Omer Rosenbaum, CTO and co-founder at Swimm, shares the results of his team's testing

Omer Rosenbaum February 26, 2026

When Anthropic published its blog post on using Claude Code for COBOL understanding—a post that wiped $40 billion off IBM’s market cap—it raised a question worth answering with data: How well do tools based on large language models (LLMs) actually handle COBOL application understanding?

It’s an important question. Enterprise modernization depends on capturing every business rule, every conditional branch, and every calculation from legacy code—completely and accurately. Getting most of it right is not sufficient when the gap between “most” and “all” determines whether payments are correct or claims are approved.

Our team works on legacy application understanding for enterprise modernization. We ran Claude Code (Opus 4.6) on real COBOL programs—not toy examples, but government Medicare payment systems—to measure coverage, accuracy and consistency against the requirements that actual migrations demand.

What Modernization Actually Demands

When an enterprise modernizes a mainframe application, it is essential to understand all the applications’ functionality regardless of the type of modernization. Business rules, edge cases, conditional logic and the calculations can determine whether a payment is correct or a claim is approved.

This sets three requirements for any tool claiming to understand legacy code:

Complete coverage – Every business rule, every flow, every algorithm across an entire application must be captured. A rule that isn’t captured can’t be migrated.
High accuracy – Confidently wrong analysis of an application can be worse than none at all when it is relied on for critical business decisions.
Consistency. The same source code should produce the same analysis every time it’s run. If two independent extractions of the same program return materially different results, you cannot trust any output.

In regulated industries, the stakes extend beyond defects. Inaccurate understanding produces a gap between what the modernized system does and what it’s legally required to do. In banking, that means capital calculation errors with regulatory consequences. In healthcare, it means incorrect Medicare billing under the False Claims Act, with civil penalties and personal liability.

What We Found

We tested two CMS Medicare COBOL programs: a 4,692-line hospice payment pricer and an 18,835-line outpatient payment pricer. The full methodology is below, but here’s what Claude Code produced.

1. Coverage: Entire Subsystems Were Missed

Across three runs, Claude mentioned 69-86% of COBOL paragraphs on the smaller program and just 24-35% on the larger one. And “mentioned” is the most generous measure of coverage—it only means the paragraph name appeared somewhere in the output.

Real coverage has three dimensions:

Paragraph reachability – Is the paragraph mentioned at all?
Rule completeness – Are the business rules inside that paragraph fully extracted, or compressed into a vague sentence?
Execution context – Is it captured how and when the paragraph actually runs? In COBOL, a paragraph can be reached via normal calls, GO TO jumps or as part of a range—each with different behavioral implications.

Many paragraphs in Claude’s output satisfied only the first dimension. On the larger program, the main payment calculation paragraph—70 lines of code that determines whether a claim gets paid, caps the patient’s coinsurance and handles claim reversals—had zero extracted rules across all three runs. That subsystem simply wasn’t there.

2. Accuracy: What ‘Covered’ Can Hide

We audited 280 business rules from the best Claude run on the smaller program. The problems were significant:

Twenty rules had meaningless descriptions. One paragraph was described as “Sum, move, zero.” That names the category of operations but without the source code in hand, there is no way to know what is being summed, moved or reset.

Seven rules were misattributed. One run stated that a feature was introduced in FY2016. It was introduced in FY2019. A developer using this output would look in the wrong section of code.

Eleven rules had dropped conditions. Consider this code:

140600     IF BILL-REV2 = '0652'
140700        IF BILL-UNITS2 > 0
140800           IF BILL-UNITS2 < 32
140900             COMPUTE WRK-PAY-RATE2 ROUNDED =
141000               ((98.19 * BILL-BENE-WAGE-INDEX) + 44.72)

Claude described this computation as happening when BILL-REV=2 and BILL-UNITS2<32 – dropping the BILL-UNITS2 > 0 condition entirely. The extracted version implies the payment applies to any claim. The actual rule only applies when the claim has units. A migration built on this extraction would produce overpayments on zero-unit claims—systematically, on every affected claim, until discovered.

Beyond incomplete descriptions, some were fundamentally wrong. On the larger program, Claude extracted this rule for a PTCA processing paragraph:

If PTCA line flag is set → Set claim flag, load PTCA day table

What the code actually does: For each PTCA line, search the PTCA day table for the line’s date of service. If not found, create a new day entry for that date. If found, update the existing entry for that day.

These are not close. The first implies a flag check and a table load. The actual logic is an iterative table search with two separate outcomes depending on whether a matching date exists. A migration based on the extracted version would fail on any claim with multiple PTCA lines on different service dates—and produce wrong results silently.

3. Consistency: Same Code, Different Results

Across three runs on the larger program, the number of extracted business rules ranged from 140 to 226—a 42% spread on the same source code. The run that produced the most rules also produced the most errors, including nine wrong regulatory dollar amounts:

Constant	Correct Value	Claude’s Value
FY2016 IP Limit	$1,288	$1,300
FY2017 IP Limit	$1,316	$1,340
FY2015 Outlier Threshold	$2,775	$2,900
FY2017 Outlier Threshold	$3,825	$3,325

These are not rounding errors. They are plausible-looking numbers that a reviewer might not catch without checking the source code directly.

This Is an Architectural Problem

The natural responses to try and fix these issues are to improve the prompt or validate the output. Neither addresses the root problem.

The prompt was refined across multiple iterations. The gaps appeared anyway—better prompting cannot fix a structural limitation.

The structural limitation is this: An LLM agent reads a codebase in passes, deciding at each step what to read next based on what it already knows. There is no mechanism that guarantees every paragraph gets visited, that live execution paths are distinguished from dead code, or that the same decisions are made across independent runs. A model that navigates a codebase cannot know what it hasn’t seen.

On validation: you cannot audit what you don’t know is missing. The main payment calculation paragraph produced zero rules in all three runs – a reviewer would have no indication that an entire subsystem was absent.

How We Tested

The two real, publicly available CMS Medicare COBOL programs are government programs used to calculate Medicare payments; CMS released COBOL source code for its Pricer applications until January 2022, when it completed conversion to Java. The COBOL versions are available in the CMS PC Pricer historical archive.

HOSPR210 – The Hospice PPS Pricer, which calculates payments for hospice claims (4,692 lines of code)
OPPSCAL – The Outpatient PPS Pricer, which calculates payments for outpatient claims (18,835 lines of code)

These programs are representative of real enterprise COBOL. They contain complex control flow, GO TO statements that jump between sections non-linearly, copybooks with rate tables and wage index calculations, and business logic layered across fiscal years. They are not toy examples.

We explicitly removed the extensive inline documentation included in the open-source version. Real enterprise COBOL doesn’t come with explanations attached. We did not further alter the programs in any way.

We ran Claude Code three times on each program—independently, with no shared context between runs. We used a refined ~800-word prompt with a four-pass extraction strategy specifically designed for COBOL understanding. Claude Code used approximately 70 tool calls per run.

What Enterprises Should Look For

These findings are specific to one use case: Understanding legacy COBOL for enterprise modernization. In this use case, the requirements—complete coverage, exact accuracy, run-to-run consistency—exceed what probabilistic tools can guarantee.

A migration that carries dropped conditions, compressed algorithms or wrong regulatory constants will produce incorrect behavior in production. For enterprises in banking, healthcare or insurance, the consequences extend well beyond a defect ticket.

The alternative is deterministic analysis—tools that parse the full program structure by construction rather than by navigation. When AI is used, it should translate deterministic findings into human-readable form, not infer them.

The COBOL versions of these programs are available in the CMS PC Pricer historical archive. Before committing to an approach for a real modernization program, run your actual COBOL through both a deterministic and an LLM-based tool and compare the results. The gap will speak for itself.

Can Claude Code Really Understand COBOL Applications?

What Modernization Actually Demands

What We Found

1. Coverage: Entire Subsystems Were Missed

2. Accuracy: What ‘Covered’ Can Hide

3. Consistency: Same Code, Different Results

This Is an Architectural Problem

How We Tested

What Enterprises Should Look For

Related Articles See more

An IBM Bob Early Adopter Shares Tips and Tricks

NetSPI Expands Suite of Human-Led, AI-Powered Continuous Pentesting Services as Organizational Attack Surfaces Grow

Craig S. Mullins Announces New Book: The Cost of ‘Good Enough’ Data: Why Modern Architectures Fail at Scale