Devlery
Blog/AI

KDD Cup Data Agents Delayed by 700 Teams and Docker Audits

KDD Cup 2026 Data Agents delayed Phase 1 results after more than 700 teams and Docker compliance checks exposed the operational cost of agent evaluation.

KDD Cup Data Agents Delayed by 700 Teams and Docker Audits
AI 요약
  • What happened: The KDD Cup 2026 Data Agents track adjusted its Phase 1 result schedule after more than 700 teams submitted Docker-based agents.
    • The official notice says some submitted images showed suspected malicious or unauthorized behavior, requiring compliance checks and a technical audit.
  • Evaluation shape: Phase 1 splits hidden tasks into about 60 A-board tasks and about 320 B-board tasks, with 2-hour and 12-hour runtime limits.
  • Developer impact: Data-agent benchmarks now depend on Docker isolation, blocked external internet, injected model endpoints, output persistence, and audit logs.
    • The benchmark fixes the official evaluation model as Qwen3.5-35B-A3B, so architecture and orchestration are meant to matter more than private model choice.
  • Watch: Local results from Claude, GPT, Gemini, DeepSeek, or other development models should not be treated as the same thing as the official score.

The Data Agents for Complex Data Analysis track at KDD Cup 2026 has turned a familiar agent story into an infrastructure story. The official site frames the competition around autonomous AI agents that decompose complex analysis questions, work across multiple data sources, reason through several steps, and produce accurate answers. The late-May update, however, was not about a new model score or a likely winner. It was about submission volume, Docker images, compliance checks, and a technical audit. Once more than 700 teams entered, evaluation infrastructure and security review became part of the benchmark.

The official rules make clear that this is not a simple leaderboard. Teams submit one Docker image. That container must traverse task directories, read each task_<id> input, and write /output/task_<id>/prediction.csv. The evaluation target is linux/amd64. During official scoring, the system injects MODEL_API_URL and credentials that point to Qwen3.5-35B-A3B. Teams may use OpenAI, Anthropic, local models, or other systems during development, but the submitted code has to switch to the organizer-provided model endpoint during evaluation.

KDD Cup 2026 Data Agents Phase 1 evaluation structure

That design says a lot about what the benchmark is trying to measure. The question is not simply "which LLM is smarter." Under a shared model condition, the competition can expose who built the more reliable system for finding data, interpreting tables and documents, writing Python or SQL steps, recovering from intermediate failures, managing timeouts, and still leaving a valid prediction.csv before the runtime expires. Once the model is fixed, orchestration, retrieval, parsing, timeout control, and output validation become the measurable difference.

The competition defines a data agent as a holistic architecture. In that framing, the agent uses knowledge understanding, reasoning, and planning to orchestrate work across the Data+AI ecosystem. The listed capabilities are concrete: break high-level analytical questions into executable steps, select tools such as Python scripts, SQL queries, and API calls, and handle structured tables, unstructured documents, charts, and multimodal data sources. This is not a chat assistant that answers a question about a spreadsheet. The submitted container has to perform data preparation, calculation, evidence extraction, validation, and answer generation inside one runnable package.

The May 29 AoE update matters because it shows where that abstract goal becomes operating cost. The official notice says more than 700 teams joined and that the submission volume exceeded expectations, putting substantial load on compute resources and scheduling. It separately says suspected malicious or unauthorized behavior was observed in some submitted Docker images and that those images disrupted evaluation jobs. The organizers then adjusted evaluation rules and the remaining schedule to keep the process stable, fair, secure, and sustainable.

Those sentences compress the security model of agent evaluation. A participant's Docker image is the answer-generating program, but it is also the attack surface. The rules prohibit external internet access during evaluation, bypassing the injected MODEL_API_URL, using another LLM service as the primary task-solving model, container escape, infrastructure probing, changes to the /input mount directory, environment-variable destruction, and shared-image submissions across teams. To evaluate complex analysis ability, the container has to be powerful enough to run real work. To preserve fairness, it also has to be constrained enough that it cannot call hidden resources, leak tasks, or interfere with the evaluation system.

The A-board and B-board split is one way to manage that cost. The Phase 1 hidden set contains about 60 A-board tasks and about 320 B-board tasks. The A-board exists for staged leaderboard feedback and has a 2-hour single-run limit. The B-board drives the final Phase 1 evaluation and has a 12-hour single-run limit. The final Phase 1 leaderboard combines weighted A-board and B-board scores. The top 60 teams advance to Phase 2. Teams ranked 1 through 40 may choose the Leaderboard or Creative subtrack, while teams ranked 41 through 60 enter only the Creative subtrack.

The shape resembles coding benchmarks such as SWE-bench, but data analysis adds different failure modes. Coding agents usually run tests and submit a patch. KDD Cup Data Agents does not guarantee one fixed input structure. The FAQ says the context/ directory can vary by task and may include combinations such as csv/, db/, json/, doc/, and knowledge.md. A single container loops through all tasks. If one task consumes the full timeout or crashes the process, later outputs can be lost. That is why the FAQ recommends task-level timeout control and writing prediction.csv immediately after each task is processed.

The scoring method also fits the data-agent target. The Leaderboard track uses column-level matching. According to the official description, column values are sorted and compared through signature counts, with recall-based scoring and a light redundancy penalty. Column names and row order do not participate in scoring. That choice rewards the actual value vector rather than a polished table label. For data agents, a plausible natural-language explanation is less important than a reproducible calculation result.

For developers, the lesson is direct. A data agent is not "let the LLM read a CSV and answer." The official environment requires file discovery, schema inference, SQL or Python execution, document grounding, numeric rounding, partial-output persistence, failure recovery, and logs. It is not just an app wrapped around a model API. It is a system that assembles a data pipeline inside a restricted runtime. Enterprise teams building internal analysis agents face the same constraints: data varies, permissions are limited, execution time is finite, and the final output must be checkable.

The fixed Qwen3.5-35B-A3B evaluation model is also worth reading carefully. Participants can use any model during development, but submitted code must not hardcode a private endpoint. It has to read the environment variables injected by the evaluation system and automatically switch to the organizer's model. That choice reduces model-selection competition and pushes attention toward agent architecture. It also resembles many enterprise deployments. Developers often cannot use whichever model they prefer in production; a central platform provides approved endpoints, keys, audit policy, and cost controls.

There is a limit to that design. A prompt and tool loop tuned locally with Claude, GPT, Gemini, DeepSeek, or another Qwen-family model may not behave the same way under Qwen3.5-35B-A3B in the official environment. Models differ in tool-use habits, JSON consistency, long-document handling, and numerical mistakes. The Phase 1 score should therefore be read as "system performance under a unified model and constrained Docker environment," not as "the best possible data-agent performance with any frontier model."

The Creative subtrack points at a second axis of competition. In Phase 2, the Leaderboard subtrack uses a harder benchmark and new modalities such as data images and data videos. The Creative subtrack looks for mature, usable, interface-friendly data-agent systems and transparent decision processes. That split maps onto the product market. One side measures hidden-benchmark accuracy. The other asks whether an analyst can use the system, review its process, understand its failures, and recover from mistakes. In enterprise adoption, that second axis can matter as much as the score.

Public community reaction is still limited. The research note found no large independent discussion on Hacker News or Reddit. Search results mainly surfaced a Hugging Face forum call for the competition, a POSTECH HAIV lab internship notice, and HKUST Guangzhou promotion on LinkedIn. That makes the event less a mainstream AI-news spectacle and more a live experiment inside the data mining, database, and agent-research communities.

The problem it raises is broader than that community. In 2025 and 2026, the AI-agent conversation expanded through coding agents, browser agents, MCP connectors, and workflow automation. Data-analysis agents receive messier inputs than many of those systems. Schemas are incomplete, CSV files are inconsistent, documents mix instructions with exceptions, and outputs are verified as numbers and tables. The agent's ability to narrow possible calculation errors matters more than the model's ability to write a confident paragraph.

Four practical questions follow for AI teams. First, when an agent processes many tasks, when does it persist partial results? The competition's advice to write prediction.csv before a timeout is the same advice production batch jobs need. Second, if a central platform injects a model endpoint, do the prompt, parser, retry logic, and tool loop survive a model swap? Third, does the Docker or sandbox layer actually block external internet and unauthorized model calls? Fourth, can evaluation logs distinguish malicious behavior, accidental rule violations, ordinary crashes, and plain timeouts?

The delay should not be treated as a small scheduling incident. It is the cost of making agent evaluation realistic. Running Docker images from more than 700 teams requires a compute queue, protection against hidden-task leakage, container controls, enforcement against model bypasses, and a way to audit suspicious behavior. The same requirements will show up in internal agent-evaluation platforms, vendor benchmarks, and procurement proofs of concept.

KDD Cup 2026 Data Agents asks which systems can analyze on behalf of a person. The late-May adjustment asks the operational version of the same question: when hundreds of teams submit those systems, how do organizers verify which model was used, which files were read, which network calls were attempted, and which outputs were actually produced? The next edge in data-agent evaluation may be less about the length of a reasoning trace and more about leaving a verifiable artifact inside a constrained runtime.

The final winners are scheduled to be announced at KDD 2026 on August 9. Before that, one thing is already visible. Data agents have become an evaluation-infrastructure problem, not only a research problem. By fixing the model, requiring Docker, splitting A-board and B-board tasks, and tightening security review, the competition previews the control list that real agent systems will need in production. Developers building data-analysis agents now have to design the prompt and model path together with containers, logs, endpoint injection, partial output, and prohibited-behavior detection.