AI Agents for COBOL Flat File data validation and cleansing

AI Agents for COBOL Flat File Data Validation and Cleansing

In the “Big Iron” world, data often lives in fixed-width flat files generated by COBOL batch jobs. These files are brittle: a single misplaced character shifts every subsequent field, corrupting the entire record.

Validating this data has traditionally required writing fragile regex parsers or dedicated COBOL utility programs. This guide demonstrates a modern “Retrofit” approach: using an AI Agent backed by an MCP Server to parse, validate, and cleanse legacy flat file data on the fly.

By offloading the parsing logic to a Model Context Protocol (MCP) server, your AI agents can “read” these mainframe dumps as structured JSON, applying complex validation rules (e.g., “If AccountType is ‘X’, then ZipCode must be non-empty”) that are difficult to encode in simple scripts.

Architectural Overview

We will build a FastMCP server that acts as a translation layer. It accepts raw fixed-width strings and a schema definition, then returns structured, validated data to the agent.

The Stack

Server: Python (FastMCP) running inside Docker.
Parsing Logic: Pure Python string slicing (reliable, zero external dependencies).
Client: CrewAI (with MCP support).
Transport: Server-Sent Events (SSE) over HTTP.

1. The MCP Server (`server.py`)

This server exposes a tool called validate_flat_file_data. It takes a raw block of text (the flat file content) and a schema definition. It attempts to parse each line and validates the data types.

File: server.py

from fastmcp import FastMCP
from typing import List, Dict, Any
import json

# Initialize the MCP server
mcp = FastMCP("COBOL Validation Service")

def parse_line(line: str, schema: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Parses a single fixed-width line based on the provided schema.
    Schema format: [{'name': 'id', 'start': 0, 'length': 5, 'type': 'int'}, ...]
    """
    record = {}
    for field in schema:
        name = field['name']
        start = field['start']
        length = field['length']
        f_type = field.get('type', 'str')

        # Extract raw value using slicing
        # Pad line with spaces if it's shorter than expected to prevent crash
        padded_line = line.ljust(start + length)
        raw_value = padded_line[start : start + length].strip()

        # Type Conversion & Validation
        try:
            if f_type == 'int':
                record[name] = int(raw_value) if raw_value else 0
            elif f_type == 'float':
                record[name] = float(raw_value) if raw_value else 0.0
            else:
                record[name] = raw_value
        except ValueError:
            record[name] = f"ERROR: Invalid {f_type} '{raw_value}'"
            record['_validation_error'] = True

    return record

@mcp.tool()
def validate_flat_file_data(raw_content: str, schema_json: str) -> str:
    """
    Parses and validates a raw COBOL flat file string against a JSON schema.

    Args:
        raw_content: The fixed-width data string (multiple lines).
        schema_json: A JSON string defining the fields.
                     Example: [{"name": "id", "start": 0, "length": 5, "type": "int"}, ...]

    Returns:
        A JSON string containing the list of parsed records and any validation errors.
    """
    try:
        schema = json.loads(schema_json)
    except json.JSONDecodeError:
        return json.dumps({"error": "Invalid schema JSON format."})

    lines = raw_content.strip().split('\n')
    results = []
    error_count = 0

    for idx, line in enumerate(lines):
        if not line.strip():
            continue

        parsed_record = parse_line(line, schema)
        parsed_record['_line_number'] = idx + 1

        if parsed_record.get('_validation_error'):
            error_count += 1

        results.append(parsed_record)

    report = {
        "total_records": len(results),
        "error_count": error_count,
        "data": results
    }

    return json.dumps(report, indent=2)

if __name__ == "__main__":
    # HOST must be 0.0.0.0 to work within Docker
    mcp.run(transport='sse', host='0.0.0.0', port=8000)

2. Docker Configuration

To ensure this server runs reliably in any environment (including Railway or Kubernetes), we containerize it.

File: Dockerfile

# Use a slim Python base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install FastMCP
RUN pip install --no-cache-dir fastmcp

# Copy application code
COPY server.py .

# Expose the port for the MCP server
EXPOSE 8000

# Run the server
CMD ["python", "server.py"]

3. Client Integration (CrewAI)

This client script connects to the running MCP server to process data. CrewAI natively supports MCP via the mcps parameter in the Crew definition.

File: agent.py

from crewai import Agent, Task, Crew
import os

# 1. Define the simulated COBOL data
# Schema: ID (0-5), Name (5-20), Balance (20-10)
raw_cobol_data = """
00101JOHN DOE       0000500.00
00102JANE SMITH     0000950.50
00103BAD DATA       INVALIDNUM
"""

# 2. Define the Schema the agent should use
schema_def = """
[
    {"name": "customer_id", "start": 0, "length": 5, "type": "int"},
    {"name": "customer_name", "start": 5, "length": 15, "type": "str"},
    {"name": "account_balance", "start": 20, "length": 10, "type": "float"}
]
"""

# 3. Define the Agent
# The agent will automatically discover the tools from the MCP server defined in the Crew
data_engineer_agent = Agent(
    role='Legacy Data Engineer',
    goal='Validate and clean mainframe flat file extracts',
    backstory='You are an expert in COBOL data structures. You identify data quality issues and format valid data.',
    verbose=True
)

# 4. Define the Task
validation_task = Task(
    description=f"""
    I have a raw chunk of COBOL flat file data:
    {raw_cobol_data}

    And here is the schema for it:
    {schema_def}

    1. Use the 'validate_flat_file_data' tool to parse this data.
    2. Analyze the JSON result.
    3. Identify which lines had errors and explain the error.
    4. Provide a final clean list of valid customers (excluding the errors).
    """,
    agent=data_engineer_agent,
    expected_output="A summary of errors found and a JSON array of valid customers."
)

# 5. Run the Crew with MCP Connection
# We explicitly connect to the MCP server running in Docker
crew = Crew(
    agents=[data_engineer_agent],
    tasks=[validation_task],
    mcps=["http://localhost:8000/sse"]  # Connects to the Dockerized MCP server
)

if __name__ == "__main__":
    print("Starting CrewAI with COBOL Validation MCP...")
    result = crew.kickoff()
    print("\n\n########################")
    print("## Final Agent Output ##")
    print("########################\n")
    print(result)

How to Run

Start the Server:

docker build -t cobol-validator .
docker run -p 8000:8000 cobol-validator

Run the Client:

# Ensure you have OPENAI_API_KEY set for CrewAI
export OPENAI_API_KEY=sk-...
python agent.py

Expected Output

The agent will send the raw data to the server. The server processes it and returns a JSON report. The agent then reasons over that report and outputs:

The following errors were found in the data:
- Line 3: The 'account_balance' field contained 'INVALIDNUM', which is not a valid float.

Here is the list of valid customers:
[
  {
    "customer_id": 101,
    "customer_name": "JOHN DOE",
    "account_balance": 500.0
  },
  {
    "customer_id": 102,
    "customer_name": "JANE SMITH",
    "account_balance": 950.5
  }
]

🛡️ Quality Assurance

Status: ✅ Verified
Environment: Python 3.11
Auditor: AgentRetrofit CI/CD

Transparency: This page may contain affiliate links.