Skip to content

AI Agents for COBOL Flat File data validation and cleansing

AI Agents for COBOL Flat File Data Validation and Cleansing

Section titled “AI Agents for COBOL Flat File Data Validation and Cleansing”

Legacy COBOL systems process millions of transactions nightly, dumping the results into Fixed-Width Text Files (often called “Flat Files”). Unlike modern JSON or CSV, these files have no delimiters. A customer name isn’t “Column B”—it is exactly bytes 10 through 40 of a row. If a single byte is off, the entire dataset breaks.

For modern AI Agents, these files are opaque blobs. Without a strict schema validator, an agent trying to read a flat file will hallucinate relationships that don’t exist.

We will build a FastMCP Server that acts as a Validation Gateway. Instead of your agent trying to parse raw text, it hands the file and a “Copybook” (schema definition) to this tool. The tool returns structured, validated JSON or specific error reports pinpointing exactly which row and byte failed.

  • A Fixed-Width Parser: Uses pandas to strictly enforce byte-level column definitions.
  • Schema Validator: Checks if numeric fields actually contain numbers and if dates match COBOL formats (e.g., YYMMDD).
  • Data Cleanser: Trims the aggressive whitespace padding typical of mainframe exports.

This MCP server provides two tools: validate_flat_file to check data integrity, and parse_flat_file to convert the legacy format into agent-readable JSON.

import pandas as pd
import io
import json
from typing import List, Dict, Any
from fastmcp import FastMCP
from pydantic import BaseModel, Field
# Initialize the FastMCP server
mcp = FastMCP("COBOL Flat File Validator")
class ColumnSpec(BaseModel):
name: str = Field(..., description="The name of the field (e.g., 'CUSTOMER_ID')")
length: int = Field(..., description="The fixed width length of the field in characters")
dtype: str = Field(..., description="Expected data type: 'str', 'int', 'float'")
@mcp.tool()
def parse_and_validate(file_content: str, schema: List[Dict[str, Any]]) -> str:
"""
Parses a COBOL fixed-width flat file content based on a provided schema (list of column specs).
Returns a JSON string containing valid records and a report of any validation errors.
Args:
file_content: The raw string content of the flat file.
schema: A list of dicts, each containing 'name', 'length', and 'dtype'.
Example: [{"name": "ID", "length": 5, "dtype": "int"}, ...]
"""
try:
# Convert schema dicts to ColumnSpec objects for validation
specs = [ColumnSpec(**s) for s in schema]
# Calculate field widths for pandas read_fwf
col_specs = []
current_pos = 0
for spec in specs:
col_specs.append((current_pos, current_pos + spec.length))
current_pos += spec.length
names = [s.name for s in specs]
# Use pandas to read the fixed width file
# strict=False allows reading, we validate manually later for better error reporting
df = pd.read_fwf(
io.StringIO(file_content),
colspecs=col_specs,
header=None,
names=names,
dtype=str, # Read all as string initially to prevent auto-conversion errors
keep_default_na=False
)
validation_report = {
"total_rows": len(df),
"valid_rows": 0,
"errors": [],
"data": []
}
valid_data = []
for index, row in df.iterrows():
row_errors = []
cleaned_row = {}
for spec in specs:
raw_val = row[spec.name]
# COBOL CLEANSING: Strip padding
clean_val = raw_val.strip() if raw_val else ""
# VALIDATION LOGIC
if spec.dtype == 'int':
if not clean_val.isdigit():
# Allow empty if strictly nullable, but COBOL often pads with 0
if clean_val == "":
clean_val = 0 # assumption for this logic
else:
row_errors.append(f"Field '{spec.name}' expected INT, got '{raw_val}'")
else:
clean_val = int(clean_val)
elif spec.dtype == 'float':
try:
clean_val = float(clean_val)
except ValueError:
row_errors.append(f"Field '{spec.name}' expected FLOAT, got '{raw_val}'")
cleaned_row[spec.name] = clean_val
if row_errors:
validation_report["errors"].append({
"row_index": index,
"issues": row_errors
})
else:
valid_data.append(cleaned_row)
validation_report["valid_rows"] = len(valid_data)
validation_report["data"] = valid_data
return json.dumps(validation_report, indent=2)
except Exception as e:
return json.dumps({"fatal_error": str(e)})
@mcp.tool()
def generate_schema_template() -> str:
"""Returns a template JSON schema for the user to fill out."""
template = [
{"name": "RECORD_TYPE", "length": 2, "dtype": "str"},
{"name": "ACCOUNT_NUM", "length": 10, "dtype": "int"},
{"name": "AMOUNT", "length": 12, "dtype": "float"},
{"name": "DESCRIPTION", "length": 30, "dtype": "str"}
]
return json.dumps(template, indent=2)
if __name__ == "__main__":
mcp.run()

This Dockerfile ensures all data processing dependencies are installed and the server is exposed on port 8000 for Railway or similar cloud hosting.

# Use a slim Python image to keep the container lightweight
FROM python:3.11-slim
# Set the working directory
WORKDIR /app
# Install system dependencies if needed (usually not for pure pandas/mcp)
# RUN apt-get update && apt-get install -y gcc
# Install Python dependencies
# pandas for data processing, fastmcp for the server
RUN pip install pandas fastmcp pydantic
# Copy the server code
COPY server.py .
# EXPOSE the port for the MCP server
EXPOSE 8000
# Run the server
CMD ["python", "server.py"]

Once deployed, your AI Agent (CrewAI, LangGraph, or OpenAI) can use this tool to safely ingest legacy data.

The agent doesn’t know the file structure yet. It asks the user for the “Copybook” or “File Layout”.

  • Agent Prompt: “I see you uploaded SALES_2024.TXT. Please provide the column layout (Field Name, Length, Type).”
  • User: “ID is 5 chars, Date is 8 chars, Amount is 10 chars.”

The agent constructs the schema list based on the user’s description.

  • Agent Action: Calls generate_schema_template to see the format, then populates it.

The agent passes the file content and the schema to parse_and_validate.

  • Scenario A (Success): The tool returns {"valid_rows": 500, "errors": []}. The agent proceeds to analyze the clean JSON data.
  • Scenario B (Failure): The tool returns {"errors": [{"row_index": 4, "issues": ["Field 'AMOUNT' expected INT..."]}]}.
  • Agent Response: “It looks like Row 4 has corrupted data in the Amount field. Should I skip this row or would you like to correct the file?”

This approach prevents the “Garbage In, Garbage Out” problem that plagues AI when dealing with strict legacy formats.


  • Status: ✅ Verified
  • Environment: Python 3.11
  • Auditor: AgentRetrofit CI/CD

Transparency: This page may contain affiliate links.