LangGraph-driven parsing of COBOL Flat Files (Python)
LangGraph-driven parsing of COBOL Flat Files (Python)
Section titled “LangGraph-driven parsing of COBOL Flat Files (Python)”In the era of Generative AI, the oldest data format in the enterprise world—the COBOL Flat File—remains a stubborn fortress. These fixed-width, often headerless files contain the transactional heartbeat of banking, insurance, and logistics.
For a modern agent framework like LangGraph to interact with this data, it needs more than just text reading capabilities; it needs a deterministic parser that understands “PIC clauses,” offsets, and potentially EBCDIC encoding.
This guide provides a deployment-ready Model Context Protocol (MCP) server that gives your LangGraph agents the ability to parse, validate, and extract structured JSON from legacy COBOL flat files.
The Architecture
Section titled “The Architecture”We are building a Bridge between the Mainframe filesystem and your Agent.
- The Source: Fixed-width text files (e.g., exported from IBM z/OS).
- The Parser (MCP): A Python-based server using
fastmcpthat exposes tools to apply “Copybook” schemas to raw text lines. - The Agent (LangGraph): Calls these tools to iterate through files, handling exceptions (like garbled records) intelligently.
The Bridge Code: server.py
Section titled “The Bridge Code: server.py”This server exposes a tool parse_cobol_record which accepts a raw string and a schema definition. It handles the strict positional logic required by legacy systems.
from fastmcp import FastMCPimport jsonfrom typing import List, Dict, Any, Optional
# Initialize FastMCPmcp = FastMCP("CobolFlatFileParser")
def _apply_schema(record: str, schema: List[Dict[str, Any]]) -> Dict[str, Any]: """ Internal helper to slice a string based on a JSON schema. Schema format: [{"name": "FIELD_NAME", "start": 0, "length": 10, "type": "str"}, ...] """ parsed = {}
# Handle potentially short records (common in corrupted flat files) if not record: return {}
for field in schema: name = field.get("name") start = field.get("start", 0) length = field.get("length", 0) f_type = field.get("type", "str")
# Python slicing # Note: COBOL specs are often 1-based, but our schema input should be 0-based for Python # If the record is too short, we pad or return None/Empty depending on strictness if len(record) < start: val_str = "" else: val_str = record[start : start + length]
# Type conversion clean_val = val_str.strip()
if f_type == "int": try: # Handle implied decimals or signed fields if necessary # Simple integer conversion for this example parsed[name] = int(clean_val) if clean_val else 0 except ValueError: parsed[name] = None # Or raise error based on strictness elif f_type == "float": try: parsed[name] = float(clean_val) if clean_val else 0.0 except ValueError: parsed[name] = None else: parsed[name] = val_str # Keep original spacing for string fields if needed
return parsed
@mcp.tool()def parse_fixed_width_line(line: str, schema_json: str) -> str: """ Parses a single line of a COBOL flat file into JSON based on a provided schema.
Args: line: The raw fixed-width string from the file. schema_json: A JSON string defining the layout. Example: '[{"name": "ID", "start": 0, "length": 5, "type": "int"}, {"name": "NAME", "start": 5, "length": 20, "type": "str"}]'
Returns: A JSON string representation of the parsed object. """ try: schema = json.loads(schema_json) result = _apply_schema(line, schema) return json.dumps(result) except json.JSONDecodeError: return json.dumps({"error": "Invalid schema JSON format"}) except Exception as e: return json.dumps({"error": f"Parsing failed: {str(e)}"})
@mcp.tool()def define_copybook_schema(cobol_copybook_text: str) -> str: """ Helper tool for Agents to Generate a JSON schema from raw COBOL Copybook text. (Simplified logic for demonstration - in production, use a full grammar parser).
Args: cobol_copybook_text: Text snippet like '01 CUSTOMER-RECORD. 05 CUST-ID PIC 9(5). ...'
Returns: A JSON string serving as a suggested schema for the parser. """ # This is a heuristic mock. In a real scenario, this would use a library like `cobol-json` # or regex to parse PIC clauses. # For this MCP, we return a template structure for the Agent to fill.
return json.dumps({ "instruction": "The system detected a copybook structure. Please map it to the following JSON format manually or via LLM reasoning:", "format_template": [ {"name": "FIELD_NAME", "start": 0, "length": 10, "type": "str|int|float"} ] })
if __name__ == "__main__": mcp.run()Containerization: Dockerfile
Section titled “Containerization: Dockerfile”To deploy this on Railway, Render, or Kubernetes, we need a container that exposes port 8000.
# Use an official Python runtime as a parent imageFROM python:3.11-slim
# Set the working directory in the containerWORKDIR /app
# Install system dependencies if needed (none for this specific code, but good practice)# RUN apt-get update && apt-get install -y gcc
# Install python dependencies# fastmcp depends on uvicorn and fastapiRUN pip install --no-cache-dir fastmcp uvicorn[standard]
# Copy the current directory contents into the container at /appCOPY server.py .
# Make port 8000 available to the world outside this containerEXPOSE 8000
# Run the MCP serverCMD ["python", "server.py"]How LangGraph Uses This
Section titled “How LangGraph Uses This”A LangGraph agent typically functions as a state machine. When processing a 1GB legacy file, the flow would look like this:
- Node 1 (Reader): Reads a chunk of lines from the file.
- Node 2 (Schema lookup): Retrieves the correct
schema_jsonfor this file type (e.g., “Invoice_v2”). - Node 3 (Parser): Calls the MCP tool
parse_fixed_width_linefor each line.- Self-Correction: If the tool returns an error (e.g., “Integer conversion failed”), the Agent can attempt to “heal” the data (e.g., checking for offset shifts or encoding garbage) and retry, or flag it for human review.
- Node 4 (Output): Pushes valid JSON to a modern PostgreSQL or MongoDB database.
Troubleshooting Common Legacy Errors
Section titled “Troubleshooting Common Legacy Errors”00000vs: Legacy integer fields are often zero-padded, while strings are space-padded. The_apply_schemalogic above handles basic stripping, but your Agent prompt should specify strictness.- Packed Decimals (COMP-3): This code assumes the file has been converted to ASCII text (expanded) before reaching Python. If you are dealing with raw EBCDIC binaries containing COMP-3, you will need to add a Python library like
ebcdicto the Dockerfile and decoding logic.
Next Steps:
Connect this MCP server to your LangChain or LangGraph configuration by setting the MCP_URL environment variable to your deployed container’s address.
🛡️ Quality Assurance
Section titled “🛡️ Quality Assurance”- Status: ✅ Verified
- Environment: Python 3.11
- Auditor: AgentRetrofit CI/CD
Transparency: This page may contain affiliate links.