First GPAI Compliance Cycle: What Providers Filed
Three weeks past the August 2 deadline. The major GPAI providers have published their training-data summaries and copyright policies. The systemic-risk providers have made their initial filings to the AI Office. We now have a meaningful sample of what compliance looks like in practice. The story is more interesting than it might appear from the headlines.
Training-data summaries: convergence on a low ceiling
Reading the published summaries side by side, the providers have converged on something close to the Commission template's minimum aggregation level. For most major frontier models, what is now public is a list of dataset categories (e.g., "publicly available web content," "licensed publishers," "code repositories," "user-generated content"), source-type rollups (e.g., "approximately 30% web crawl," "approximately 15% licensed text"), and high-level filtering descriptions.
What is conspicuously not in the summaries:
- Specific dataset names beyond Common Crawl, Wikipedia, and a small handful of datasets that the providers have already publicly disclosed elsewhere.
- Quantitative breakdowns by language, domain, or content type beyond top-line percentages.
- Specific information about copyrighted-content sources beyond the categorical (e.g., "books," "news articles") rather than nominal (e.g., specific publishers).
- Information about the technical filtering pipelines used to remove specified content.
This is the floor of compliance, not the ceiling, and the Commission has signaled that it will engage on whether the floor is high enough. The Commission's first information request under Article 91, sent August 14 to three major providers, asks for additional detail on (a) the specific datasets used in pretraining, (b) the rightsholder opt-out implementation details, and (c) the volumes of content from each major source category. We will see how providers respond.
For practitioners advising downstream parties: read the summaries skeptically. They satisfy the regulatory obligation but generally do not provide enough information to support a downstream party's own due-diligence on training-data risk. If your client needs more detailed information for litigation due diligence or for its own copyright posture, it will need to negotiate that bilaterally with the provider.
Copyright policies: the opt-out implementation question
The copyright policies under Article 53(1)(c) have been the more interesting filings. Several patterns:
- Robots.txt as the floor. All providers honor robots.txt-based crawler directives. Most have provider-specific user-agent strings (the OpenAI GPTBot, the Google-Extended for Gemini training, etc.) and represent that they respect rightsholders' explicit opt-outs.
- The TDM Reservation Protocol. Several providers have committed to honoring the IETF TDM Reservation Protocol drafts, which provides a more structured way for rightsholders to express opt-outs. Adoption is uneven; the major search-and-AI players are committed, smaller providers vary.
- Licensed content carve-outs. Providers with significant licensed-content programs (Microsoft, Google, OpenAI's publisher deals) have separate sections describing licensed content and the negotiated terms. These are partially redacted in the public versions.
- Retrospective opt-outs. Most providers explicitly do not commit to retroactive removal of content from already-trained models when rightsholders subsequently opt out. The argument is that doing so would require retraining at prohibitive cost. The legal sufficiency of this position under Article 4(3) of the CDSM Directive will be tested.
The "no retroactive removal" position is the legal pressure point. If the AI Office takes the view that Article 4(3) requires that opt-outs prevent existing models from continuing to be deployed (rather than just preventing new training), the consequences for current models are dramatic. The Commission has been silent so far. Expect this to be a major 2026 enforcement question.
Systemic-risk filings: what we can see
Article 55 obligations have a different disclosure posture. The model evaluations, risk assessments, and incident-reporting submissions are made to the AI Office and are not generally public. What is public:
- Code of Practice signatures and provider commitments to specific procedural steps.
- Some providers have voluntarily published their pre-deployment evaluation results and "system cards" or equivalent technical reports. Anthropic, Google DeepMind, and OpenAI have all done this for at least some models. Meta and several others have not.
- Public-facing safety frameworks, often in the form of "responsible scaling policies" or equivalent. These have proliferated in the post-Seoul period and are now mostly stable as a class of document.
The non-public submissions are where the meaningful compliance evaluation happens. The AI Office has staffed up its model-evaluation capacity meaningfully — though, as I noted in May, headcount is still tight. Several technical contractors have been engaged to support evaluations. The substantive review of these submissions is going to take many months.
Ten provider trends I noticed reading across the filings
- Documentation quality varies enormously by provider, more than I expected. Top-tier providers have publication-quality documentation packages; bottom-tier providers have placeholder text in places.
- Providers with significant U.S. plaintiff-side copyright exposure (i.e., named defendants in NYT v. OpenAI and similar cases) are visibly more conservative in their public disclosures than those without.
- Open-weights providers (Mistral, Meta for Llama) have additional documentation about downstream usability that is genuinely useful.
- The downstream-provider documentation packages are uniformly better than the public-facing summaries. The asymmetry suggests providers are treating downstream parties as the actual audience.
- Energy and compute reporting is uneven. The Commission template asks for it; some providers report meaningfully and others do not.
- Several smaller providers have outsourced the documentation work to specialist consultancies. The output is consistent in shape across multiple providers, suggesting concentrated drafting.
- Multilingual disclosure has been a problem. Several providers have published EU-targeted documentation only in English; member-state authorities have signaled this is acceptable for now but should not become the long-term norm.
- Provider responses to the Commission's August 14 information request will be published in some form by September. Watch this space.
- Cybersecurity attestations under Article 55(1)(d) are mostly cross-references to existing ISO 27001 certifications, which is acceptable but probably will not remain so as the AI Office develops more AI-specific cybersecurity expectations.
- Incident-reporting infrastructure has been the most underdeveloped area. The Code of Practice's serious-incident definition is precise but operationalizing it requires monitoring and triage workflows that many providers have not built. Expect this to be the most common gap when the AI Office begins detailed reviews.
What's next
The first regulatory dialogues — formal inquiries by the AI Office under Article 91 — are running in parallel with the August Commission letter. Quiet engagement will dominate the rest of 2025. Public enforcement decisions are unlikely before Q1 2026 except in cases of egregious non-compliance.
For practitioners: review the published summaries from the providers your clients use. The disclosure floor that has emerged is the floor for now, but it is unlikely to remain the long-term equilibrium. Plan compliance posture for a higher disclosure expectation in 2026 and beyond.