Issues of Domain-Based Approaches

Robots.txt

Robots.txt has emerged as a practical solution to express rightsholders' preferences for web-published content, it is easy to implement, widely used and a recognised standard.

However, for stakeholders in many sectors, putting in place a rights reservation robots.txt is simply irrelevant. Location/domain-based approaches are not effective for the following reasons:

  1. Most professional creators and rightsholders do not publish their content online;

  2. Copyrighted content is republished on the Internet without authorisation;

  3. Content is shared on websites rightsholders do not control, such as social media platforms or licensor's websites.

Robots.txt is insufficient as a primary opt-out mechanism for protecting creators’ rights in AI training. It only applies in limited circumstances where rightsholders have direct domain control, cannot recognise content published without authorisation, and fails to address downstream licensing and usage of creative works. AI regulations must adopt more sophisticated, state-of-the-art measures, such as asset-level rights rights reservation, to ensure comprehensive protection for creators and rightsholders.

Most professional creators and rightsholders do not publish their content online

Many creators in industries such as music, film, book publishing, and audiobooks do not distribute and publish their content directly online. Yet, their works may already be used in AI training datasets through other means (e.g., data bundles, unauthorized uploads). Relying solely on robots.txt disregards creators whose content is included in datasets without ever being published on a controlled website.

A more robust framework would require AI developers to verify whether rights reservations are put in place at the asset or work level, independent of robots.txt implementation.

Copyrighted content is republished on the Internet without authorisation.

Content frequently appears online without rightsholders' permission due to piracy, negligence, or ignorance. While "Sub-Measure 4.5. No crawling of piracy websites" is a positive step, it fails to address cases of unauthorized republication on legitimate platforms. For example, copyrighted images may circulate on forums, blogs, or social media without any indication of their original source.

Robots.txt is ineffective here because those sharing the content may not be aware of the original rights reservation or may actively disregard it.

Content is shared on websites rightsholders do not control, such as social media platforms

Robots.txt requires control over the hosting domain, making it ineffective for content distributed on third-party platforms, such as social media or licensing partners’ websites, where creators cannot implement their individual robots.txt settings.

For instance, a photographer's image licensed to a news outlet may be shared on social media, where the original robots.txt settings are neither implemented nor enforceable – neither by the original rightsholder nor the licensee. Even when creators use robots.txt on their own websites, they cannot ensure downstream compliance. Licensing terms are often dictated by the licensee further along the distribution chain, and economic constraints may prevent the licensor from enforcing robots.txt settings effectively.

Last updated