This document analyses the respective merits of different approaches to content aggregation on the Web. It builds atop this comparative analysis to outline an architecture and requirements for a technology to support content aggregation on the Web (for search engines, social media, and syndication services), in a manner that follows the Priority of Constituencies, notably in privileging user experience, user privacy, as well as simplicity and control for authors (also often known as "publishers") [[html-design-principles]].
Content aggregation involves every constituency of the Web. A core goal of this document is to provide a backbone to elaborate content aggregation solutions that prioritise consensus over the unilateral imposition of technology through excessive market power.
"A defining characteristic of the Web is that it allows embedded references to other resources" [[webarch]]. As such, content aggregation — the curation of lists of content external to one's control — is as old as the Web itself and has gone through multiple embodiments over the decades. This historical perspective allows us to tease out best practices from ones that produce problematic outcomes.
One difficulty that should be called out is that it is challenging to enable aggregation without increasing centralisation, and with that the control of the centralising entity. The easy way out is to copy non-Web solutions and have content centralised with the aggregator. The Web already resembles a proprietary application store: an increasingly centralised source of discovery which is also the single source of revenue. This document is written with the assumption that competing with proprietary application stores by mimicking their architectural characteristics is not a desirable direction. Improvements upon the status quo should instead play to the Web's strengths and seek to increase decentralisation.
The initial aggregation format is simply HTML [[HTML]], which with its hypertext linking capabilities has enabled the curation resources since its inception. Widespread interest in allowing publishers to "push" content to users led to a codification of publisher/reader interactions in open protocols such as the Channel Definition Format [[CDF-SUBM]], the RDF Site Summary (later re-branded as Rich Site Summary and Really Simple Syndication) [[RSS-MEDIA-TYPE]], and Atom [[rfc4287]]. These formats mix machine-readable metadata about content which makes it easy to present lists of resources in a user-friendly manner, links to the content, and sometimes the possibility to include the entirety of the source as HTML (depending on publisher choice).
In more recent times, a flurry of aggregator-dominated formats have emerged. They feature varying degrees of sophistication but have in common that they require publishers to recast their content into a language covering a subset of the Web platform. These proprietary languages range from limited, as is the case with Accelerated Mobile Pages defined by Google (also used by Microsoft Bing and Twitter) [[AMP]], to severely limited, as with the Apple New Format [[APPLE-NEWS-FORMAT]] or Facebook Instant Articles [[FBIA]]. These technologies have slightly different areas of focus, which can be bundling for portals (AMP or [[MIP]]), for apps (ANF), or for social media (FBIA) but from the perspective of users and publishers they are more alike than different as a set of proprietary formats with largely comparable properties.
A more recent proposal, Web Bundles [[WEB-BUNDLES]], has recently been discussed by the Web and Internet communities [[rfc8752]]. It attempts to make content distribution look like content linking but while cutting publishers out of the distribution of their own content. While this does lead to some improvements over other options and on several axes it is the least hostile of the proprietary options (as discussed below) this proposal has yet to garner any consensus.
In comparing solutions for aggregation we rely on the following properties:
There is substantial historical precedent in content aggregation on the Web. However, it does not seem necessary to produce a detailed investigation of each past technology. We can instead outline three major architectures: the traditional one on which the Web was built, the one that has come to dominate in recent years and which the standards community wishes to improve upon, and one grounded in a consensus approach to Web architecture that addresses the limitations of the traditional model. Experience with historical examples can serve to illustrate various aspects of each architecture.
Links are the original embodiment of content aggregation and remain the gold standard for content curation on the Web. Any viable solution should improve upon linking.
Linking enables a fundamental architectural property of the Web which is the separation of discoverability from distribution. This powerful feature is grounded in URLs, which are the defining architectural element of the Web and what initially set it apart from other hypertext systems (of which there were many).
According to our methodology, links have the following properties:
|User Experience||optimal||The full power of the Web platform is available; in fact this is the expectations of how the Web platform is intended to work.|
The user has distinct and clearly delineated relationships with separate first
parties being the aggregator and the publisher (especially with judicious
application of a
|Content Preview||ad hoc||While some de facto systems exist to provide metadata that enables limited content previewing (primarily in the form of "cards" intended for social media usage), there is currently no evident standard that could enable content preview simply from linking.|
|Preloading||not supported||Any manner of preloading would be entirely ad hoc.|
|Publisher LoE||reasonable||All that the publisher needs to do is publish.|
|Revenue Model||open||Publishers have full control over the advertising and subscription mechanisms for their content. The onus is fully on publishers to maintain unique infrastructure for their revenue model.|
In recent years, as the Internet became increasingly concentrated around a small number of extremely powerful actors, the Web's dominant content aggregation architecture has shifted towards proprietary formats imposed by aggregators in furtherance of their own preferences. These share a set of broad characteristics:
|User Experience||degraded||Every technology in this architecture class degrades the user experience by limiting the capabilities of the Web platform to those deemed useful and legible by the aggregator. Web Bundles [[WEB-BUNDLES]] attempts principally to solve this issue.|
|User Privacy||degraded||Technologies in this group try to shift relationships that users expect to have with publishers towards relationships with aggregators instead, sometimes even providing deceptive indications regarding the URL from which the content is served, leading data to leak from a publisher context to an aggregator context.|
|Content Preview||supported||The legibility enables higher-quality preview of content. Note: in some cases, essentially the full content is loaded into a carousel but this use case is not considered here because overriding another party's interaction modalities is considered excessively hostile compared to the decisively mutualistic and consensual philosophy of this document. [[CAROUSEL]]|
|Preloading||supported||The blurring of discoverability and distribution directly delivers the ability to preload content.|
|Publisher LoE||exploitative||Since these formats are entirely ad hoc and primarily focused on aggregator use cases, publishers have to work to optimise aggregators' platforms.|
|Revenue Model||limited||Publishers often have to work within the constraints on advertising instituted by aggregators, if that revenue model is available at all. Support for subscription or login-controlled content is either not present, requires extra optimization on behalf of publishers, or mandates that publishers use the aggregator's own subscription system.|
Both the Links and the Aggregator-Dominated approaches have different shortcomings. The approach advocated by the Cooperative Syndication aggregation architecture stems from multiple principles:
|User Experience||optimal||Content is simply captured using the Web Platform, therefore its full power available.|
|User Privacy||optimal||Content is pointed to using links and distributed by publishers. Therefore, the user has distinct and clearly delineated relationships with separate first parties being the aggregator and the publisher.|
|Content Preview||supported||An existing ecosystem of metadata, principally stemming from social media "card" languages as well as from schema.org [[schema-org]] exists and is waiting for nothing more than some further formalisation and extensions to render it capable of more powerful previewing. The following section provides some high level examples of the direction in which these existing capabilities could be developed.|
|Preloading||supported||This is a relatively more difficult feature to develop, however it can usefully be approached in a generic manner. Prefetching using stateless requests and IP privacy can already go a long way towards addressing the issue. Publishers could further rely on bundling (unsigned, simply served from their domain) to enable more effective prefetching.|
|Publisher LoE||reasonable||Under this architecture, the bulk of work that publishers need to carry out is to enhance their existing content with better machine-readable metadata, and possibly with some bundling that they could themselves use on their own sites. Overall, this presents a reasonable increase compared to linking and a substantial improvement over aggregator-dominated bundling.|
|Revenue Model||open||Publishers would be free to continue their existing revenue models based on users viewing the full content, supported by unencumbered advertising or subscription model. A collaborative hand-off between aggregators and publishers could be possible using entitlement tokens.|
Further discussion on the community's preferred architecture are warranted before too much work is invested into outlining detailed requirements. Some aspects of the Cooperative architecture are nevertheless worth calling out early, even if only in outline at this stage.
Multiple ways exist on today's Web to provide preview capabilities through "cards", used on social networks such as Facebook and Twitter, and reused by other parts of the ecosystem (eg. Slack or Keybase). Additionally, the schema.org language [[schema-org]], incarnated through multiple syntaxes, is widely used in content.
These machine-readable (though at times presentation-oriented) features are enough to support user-friendly content preview and can be extended further.
Other implementations might borrow from or build upon Google's portal proposal, allowing publishers to control the presentation of a responsive preview that could conform to the needs of aggregators, while maintaining privacy and performance.
Rather than relying on blurring the distinction between discoverability and distribution as today's aggregator-dominated architecture tends to (and [[WEB-BUNDLES]] further), the Web would be better-served by improvements to standard prefetching capabilities.
Notably, publishers would be incentivised to collaborate with prefetching if it aligns with their own distribution being faster rather than forklift their content to be served by a third party while misleading the user as to where the content was loaded from [[WEB-BUNDLES]]. This could include providing bundles (unsigned, served directly), if that indeed proves faster than alternative HTTP2 mechanisms.
It should be noted that loading someone else's content without fetching from their site is explictly not a use case of this document. A publisher's site should either be loaded directly from them (or one of their contracted delegates, such as a CDN in their employ). The explicitly stated goal of [[WEB-BUNDLES]] is to load a publisher's content while excluding the publisher from the transaction ("This attempt is instead motivated by avoiding a connection to the origin server at all." [[WEB-BUNDLES-USE-CASES]]). This makes it a solution everything but cooperative.
Additionally, user agents should be encouraged to develop support for IP privacy in service of their users' reasonable expectations of privacy. Such a feature, while generic, would also help support privacy-preserving prefetches.
Some systems exist that enable forms of profit sharing between aggregators and publishers. These can follow multiple models such as subscriptions (Subscribe with Google), subsidising publishers to make some of their content available (Facebook News Tab), or sharing a tiny fraction of the revenue from content aggregator's bundle (Apple News Plus).
The ecosystem would globally benefit from enabling aggregators and publishers to share entitlement tokens that would prove that a given user is entitled to specific access. This would enable business models in which publishers and aggregators work together on generating revenue according to their respective participation.
implementation-wise, this could take the form of a non-fungible token or of some sort of nonce provided as part of the aggregator's linking to the publisher (possibly mediated by the user agent).
The terms under which aggregators use and reuse publisher content are often decided unilaterally (by the aggregator). This could be improved through the provision of machine-readable licenses that could for instance allow citation for the purpose of linking but not to provide direct answers, or could grant reuse rights for some pictures as part of a portal but not rights to mine the text to train digital assistants with.
Some aggregators run applications that embed Web rendering components that they use instead of a user's chosen browser. This practice, known as "in-app browsers", is an antipattern that causes countless problems with users not being logged in to publisher sites and not having their browsing preferences respected.
This practice also unfairly attempts to keep users on the aggregator's property instead of letting users access the Web properly. Since the user's chosen agent is being substituted, the opportunity for privacy violations is rampant.
While banning this practice entirely may be beyond the reach of the standards process, it would be desirable for browser engines to support an in-app busting HTTP header that would enable publishers to break out of aggregator apps trying to jail in their experience and to keep tracking user behaviour while users believe they are on another property.