Content Aggregation Technology (CAT)

This document analyses the respective merits of different approaches to content aggregation on the Web. It builds atop this comparative analysis to outline an architecture and requirements for a technology to support content aggregation on the Web (for search engines, social media, and syndication services), in a manner that follows the Priority of Constituencies, notably in privileging user experience, user privacy, as well as simplicity and control for authors (also often known as "publishers") [[html-design-principles]].

Content aggregation involves every constituency of the Web. A core goal of this document is to provide a backbone to elaborate content aggregation solutions that prioritise consensus over the unilateral imposition of technology through excessive market power.

Brief Overview of Existing Solutions

The initial aggregation format is simply HTML [[HTML]], which with its hypertext linking capabilities has enabled the curation resources since its inception. Widespread interest in allowing publishers to "push" content to users led to a codification of publisher/reader interactions in open protocols such as the Channel Definition Format [[CDF-SUBM]], the RDF Site Summary (later re-branded as Rich Site Summary and Really Simple Syndication) [[RSS-MEDIA-TYPE]], and Atom [[rfc4287]]. These formats mix machine-readable metadata about content which makes it easy to present lists of resources in a user-friendly manner, links to the content, and sometimes the possibility to include the entirety of the source as HTML (depending on publisher choice).

In more recent times, a flurry of aggregator-dominated formats have emerged. They feature varying degrees of sophistication but have in common that they require publishers to recast their content into a language covering a subset of the Web platform. These proprietary languages range from limited, as is the case with Accelerated Mobile Pages defined by Google (also used by Microsoft Bing and Twitter) [[AMP]], to severely limited, as with the Apple New Format [[APPLE-NEWS-FORMAT]] or Facebook Instant Articles [[FBIA]]. These technologies have slightly different areas of focus, which can be bundling for portals (AMP or [[MIP]]), for apps (ANF), or for social media (FBIA) but from the perspective of users and publishers they are more alike than different as a set of proprietary formats with largely comparable properties.

A more recent proposal, Web Bundles [[WEB-BUNDLES]], has recently been discussed by the Web and Internet communities [[rfc8752]]. It attempts to make content distribution look like content linking but while cutting publishers out of the distribution of their own content. While this does lead to some improvements over other options and on several axes it is the least hostile of the proprietary options (as discussed below) this proposal has yet to garner any consensus.

Assessment Methodology

In comparing solutions for aggregation we rely on the following properties:

User Experience: This is measured based on whether the proposal keeps the full capabilities of the Web platform available or if it degrades user experience by enforcing a subset of the platform. Values on this dimension can be optimal just in case the full power of the Web platform is available in service of the user's experience, or degraded if the user experience is limited to a subset of the platform.
User Privacy: For simplicity, we treat violations of privacy as any information that leaks outside of the direct relationship between the user on one side, and the first party and its service providers on the other (as defined in [[tracking-dnt]]). Values can be optimal just in case user privacy is not worsened by the aggregation architecture (endemic problems of privacy on the Web notwithstanding), or degraded if the architecture specifically makes privacy worse.
Content Preview: This property indicates whether the aggregation architecture inherently lends itself to providing a rich preview for the content or if previews have to be built in an ad hoc manner. The values are, respectively, supported or ad hoc.
Preloading: Whether effective preloading of the content, intended to speed up click-through, can easily be supported. Values: supported and not supported.
Publisher Level of Effort: Several of the existing aggregations solutions function by pressuring publishers to take on the work of optimising aggregator platforms such that publishers are handicapped if they don't participate in the self-preferencing format. The philosophy of this document is encourage the development of an ecosystem that pushes towards the mutualistic end of the mutualistic-parasitic axis. Accordingly, the values along this dimension can be either reasonable when the publisher has simply to produce the content alongside possibly some moderate effort to render it machine-readable according to open standards, or exploitative when the architecture levies work from publishers that is intended solely or primarily to optimise the aggregator's platform.
Revenue Model: An inescapable truth is that publishers and aggregators both need to generate revenue to continue their operations. Revenue is typically generated directly from readers/users (subscriptions) or indirectly (advertising). Privacy and user experience concerns can be leveraged to create limitations on the capabilities of the formats, which can impact which parties are able to monetize the published/aggregated content. Values on this dimension can be open, meaning the creator of the content has full control over how revenue is derived from the work, or limited meaning the creator has to work within artificial constraints.

Potential Architectures

There is substantial historical precedent in content aggregation on the Web. However, it does not seem necessary to produce a detailed investigation of each past technology. We can instead outline three major architectures: the traditional one on which the Web was built, the one that has come to dominate in recent years and which the standards community wishes to improve upon, and one grounded in a consensus approach to Web architecture that addresses the limitations of the traditional model. Experience with historical examples can serve to illustrate various aspects of each architecture.

Links

Links are the original embodiment of content aggregation and remain the gold standard for content curation on the Web. Any viable solution should improve upon linking.

Linking enables a fundamental architectural property of the Web which is the separation of discoverability from distribution. This powerful feature is grounded in URLs, which are the defining architectural element of the Web and what initially set it apart from other hypertext systems (of which there were many).

According to our methodology, links have the following properties:

Assessment of the *Link* architecture
Dimension	Value	Notes
User Experience	`optimal`	The full power of the Web platform is available; in fact this is the expectations of how the Web platform is intended to work.
User Privacy	`optimal`	The user has distinct and clearly delineated relationships with separate first parties being the aggregator and the publisher (especially with judicious application of a `Referrer-Policy` [[referrer-policy]]).
Content Preview	`ad hoc`	While some de facto systems exist to provide metadata that enables limited content previewing (primarily in the form of "cards" intended for social media usage), there is currently no evident standard that could enable content preview simply from linking.
Preloading	`not supported`	Any manner of preloading would be entirely ad hoc.
Publisher LoE	`reasonable`	All that the publisher needs to do is publish.
Revenue Model	`open`	Publishers have full control over the advertising and subscription mechanisms for their content. The onus is fully on publishers to maintain unique infrastructure for their revenue model.

Aggregator-Dominated Bundling

In recent years, as the Internet became increasingly concentrated around a small number of extremely powerful actors, the Web's dominant content aggregation architecture has shifted towards proprietary formats imposed by aggregators in furtherance of their own preferences. These share a set of broad characteristics:

They attempt to limit what the Web platform can do. This helps aggregators treat content as a fungible commodity rather than as differentiated production grounded in craft. They also privilege aggregators (and runaway scale) by favouring legibility of the content over expressivity and local knowledge of one's topic and audience [[SEEING-LIKE-A-STATE]].
They tend to optimise characteristics that define the aggregator/user relationship (particularly performance or uniformity) over those that prevail in publisher/user relationship (experience or trust). Put differently, they constitute a faster way to deliver a degraded experience.
They make publishers do work to optimise the aggregator's offering as a levy that must be discharged to maintain traffic or revenue. Little value is derived from the aggregation architecture itself, and in fact "several online publishers indicated that if it weren't for the privileged position in the Google Search carousel given to AMP content, they would not publish in that format." [[rfc8752]]

Assessment of the *Aggregator-Dominated* architecture
Dimension	Value	Notes
User Experience	`degraded`	Every technology in this architecture class degrades the user experience by limiting the capabilities of the Web platform to those deemed useful and legible by the aggregator. Web Bundles [[WEB-BUNDLES]] attempts principally to solve this issue.
User Privacy	`degraded`	Technologies in this group try to shift relationships that users expect to have with publishers towards relationships with aggregators instead, sometimes even providing deceptive indications regarding the URL from which the content is served, leading data to leak from a publisher context to an aggregator context.
Content Preview	`supported`	The legibility enables higher-quality preview of content. Note: in some cases, essentially the full content is loaded into a carousel but this use case is not considered here because overriding another party's interaction modalities is considered excessively hostile compared to the decisively mutualistic and consensual philosophy of this document. [[CAROUSEL]]
Preloading	`supported`	The blurring of discoverability and distribution directly delivers the ability to preload content.
Publisher LoE	`exploitative`	Since these formats are entirely ad hoc and primarily focused on aggregator use cases, publishers have to work to optimise aggregators' platforms.
Revenue Model	`limited`	Publishers often have to work within the constraints on advertising instituted by aggregators, if that revenue model is available at all. Support for subscription or login-controlled content is either not present, requires extra optimization on behalf of publishers, or mandates that publishers use the aggregator's own subscription system.

Cooperative Syndication

Both the Links and the Aggregator-Dominated approaches have different shortcomings. The approach advocated by the Cooperative Syndication aggregation architecture stems from multiple principles:

It should be the product of consensus amongst the various constituencies of the Web so as to reflect a healthy balance of power in the global digital ecosystem and not imposed unilaterally by a single actor simply by dint of overwhelming market power. This is essentially a return to Open Stand principles in lieu of monopoly power. [[OPEN-STAND]]
It should maintain the architectural values supported by URLs in general, and the discoverability/distribution distinction in particular.
It should proceed through evolution, not revolution: do not make massive architectural changes to address simple problems, but instead try to support existing content, to pave the cowpaths, and not to reinvent the wheel. [[html-design-principles]]

Requirements for the *Collaborative Syndication* architecture
Dimension	Value	Requirement
User Experience	`optimal`	Content is simply captured using the Web Platform, therefore its full power available.
User Privacy	`optimal`	Content is pointed to using links and distributed by publishers. Therefore, the user has distinct and clearly delineated relationships with separate first parties being the aggregator and the publisher.
Content Preview	`supported`	An existing ecosystem of metadata, principally stemming from social media "card" languages as well as from schema.org [[schema-org]] exists and is waiting for nothing more than some further formalisation and extensions to render it capable of more powerful previewing. The following section provides some high level examples of the direction in which these existing capabilities could be developed.
Preloading	`supported`	This is a relatively more difficult feature to develop, however it can usefully be approached in a generic manner. Prefetching using stateless requests and IP privacy can already go a long way towards addressing the issue. Publishers could further rely on bundling (unsigned, simply served from their domain) to enable more effective prefetching.
Publisher LoE	`reasonable`	Under this architecture, the bulk of work that publishers need to carry out is to enhance their existing content with better machine-readable metadata, and possibly with some bundling that they could themselves use on their own sites. Overall, this presents a reasonable increase compared to linking and a substantial improvement over aggregator-dominated bundling.
Revenue Model	`open`	Publishers would be free to continue their existing revenue models based on users viewing the full content, supported by unencumbered advertising or subscription model. A collaborative hand-off between aggregators and publishers could be possible using entitlement tokens.

Requirements for Cooperative Syndication

Further discussion on the community's preferred architecture are warranted before too much work is invested into outlining detailed requirements. Some aspects of the Cooperative architecture are nevertheless worth calling out early, even if only in outline at this stage.

Standardising Card Preview

Multiple ways exist on today's Web to provide preview capabilities through "cards", used on social networks such as Facebook and Twitter, and reused by other parts of the ecosystem (eg. Slack or Keybase). Additionally, the schema.org language [[schema-org]], incarnated through multiple syntaxes, is widely used in content.

These machine-readable (though at times presentation-oriented) features are enough to support user-friendly content preview and can be extended further.

Other implementations might borrow from or build upon Google's portal proposal, allowing publishers to control the presentation of a responsive preview that could conform to the needs of aggregators, while maintaining privacy and performance.

Improved Privacy-Preserving Prefetch

Rather than relying on blurring the distinction between discoverability and distribution as today's aggregator-dominated architecture tends to (and [[WEB-BUNDLES]] further), the Web would be better-served by improvements to standard prefetching capabilities.

Notably, publishers would be incentivised to collaborate with prefetching if it aligns with their own distribution being faster rather than forklift their content to be served by a third party while misleading the user as to where the content was loaded from [[WEB-BUNDLES]]. This could include providing bundles (unsigned, served directly), if that indeed proves faster than alternative HTTP2 mechanisms.

It should be noted that loading someone else's content without fetching from their site is explictly not a use case of this document. A publisher's site should either be loaded directly from them (or one of their contracted delegates, such as a CDN in their employ). The explicitly stated goal of [[WEB-BUNDLES]] is to load a publisher's content while excluding the publisher from the transaction ("This attempt is instead motivated by avoiding a connection to the origin server at all." [[WEB-BUNDLES-USE-CASES]]). This makes it a solution everything but cooperative.

Additionally, user agents should be encouraged to develop support for IP privacy in service of their users' reasonable expectations of privacy. Such a feature, while generic, would also help support privacy-preserving prefetches.

Transmissible Entitlement Tokens

Some systems exist that enable forms of profit sharing between aggregators and publishers. These can follow multiple models such as subscriptions (Subscribe with Google), subsidising publishers to make some of their content available (Facebook News Tab), or sharing a tiny fraction of the revenue from content aggregator's bundle (Apple News Plus).

The ecosystem would globally benefit from enabling aggregators and publishers to share entitlement tokens that would prove that a given user is entitled to specific access. This would enable business models in which publishers and aggregators work together on generating revenue according to their respective participation.

implementation-wise, this could take the form of a non-fungible token or of some sort of nonce provided as part of the aggregator's linking to the publisher (possibly mediated by the user agent).

Machine-Readable Distribution Licenses

The terms under which aggregators use and reuse publisher content are often decided unilaterally (by the aggregator). This could be improved through the provision of machine-readable licenses that could for instance allow citation for the purpose of linking but not to provide direct answers, or could grant reuse rights for some pictures as part of a portal but not rights to mine the text to train digital assistants with.

Addressing In-App Browsers

Some aggregators run applications that embed Web rendering components that they use instead of a user's chosen browser. This practice, known as "in-app browsers", is an antipattern that causes countless problems with users not being logged in to publisher sites and not having their browsing preferences respected.

This practice also unfairly attempts to keep users on the aggregator's property instead of letting users access the Web properly. Since the user's chosen agent is being substituted, the opportunity for privacy violations is rampant.

While banning this practice entirely may be beyond the reach of the standards process, it would be desirable for browser engines to support an in-app busting HTTP header that would enable publishers to break out of aggregator apps trying to jail in their experience and to keep tracking user behaviour while users believe they are on another property.

Introduction