Data hunger is rewriting the rules of web scraping

Mark Venables

Share this article

Video platforms have surpassed traditional text-based sources as the primary targets for large-scale scraping, highlighting how the pursuit of diverse training data is reshaping the digital economy. The race to feed multimodal AI models is redefining the balance of power on the internet, forcing executives to rethink their data strategy.

Artificial intelligence systems are only as strong as the data that shapes them. The latest Most Scraped Websites report from Decodo shows how far organisations will go to satisfy that appetite. In a single year, TikTok has surged from outside the top ten to become the most scraped site on the planet, its traffic targeted for extraction growing by more than 300 per cent. Behind the headline sits a profound shift in how companies prepare their models for the next generation of machine learning and reasoning.

Executives have long known that large language models depend on immense volumes of text. What is different now is the growing demand for multimodal data: audio, video, images and the contextual signals that make these sources richer and far more challenging to process. Decodo’s research shows that video-first platforms now account for nearly 40 per cent of scraping activity, overtaking search engines and e-commerce. TikTok and YouTube lead the list, but the wider pattern is clear. AI developers are moving beyond words.

Vaidotas Juknys, head of commerce at Decodo, captures the new dynamic. “We are seeing a clear move toward websites that have lots of different types of content instead of just basic info,” he says. “The biggest reason for this shift is that everyone needs tons of varied, good-quality data to train AI chatbots, language models, and other smart tools. His observation reflects a technical reality: models trained solely on text will never fully master video understanding, speech recognition or the interplay of language and movement that drives modern communication.

Video becomes the new training ground

Platforms once seen primarily as entertainment channels are now critical training grounds for advanced AI systems. TikTok’s short-form videos, layered with audio, captions and engagement metrics, provide a multimodal treasure trove. YouTube follows close behind, offering long-form content that captures conversation, gesture and cultural nuance. For AI researchers, these sources enable the development of systems that can reason across text, sound, and vision with a depth that static images cannot match.

The move towards video scraping is not simply about building consumer-facing chatbots. Enterprise applications from autonomous vehicles to healthcare diagnostics depend on models that can interpret complex visual and auditory signals. Training such systems requires more than a snapshot. They need sequences, movement and the unpredictable noise of real-world environments. That demand explains why traditional text-heavy sources like Google and Amazon, while still heavily scraped, are losing relative share.

Gabrielė Verbickaitė, senior product marketing manager at Decodo, highlights the underlying shift. “Data might have been the new oil in 2006, but in 2025 it is the fuel that powers artificial intelligence,” she explains. “And AI systems have an appetite for fresh, diverse, and high-quality training data at unprecedented scale.” Her words underline a challenge that reaches far beyond the research lab. Boards and executives must treat external data acquisition as a strategic priority, with the same attention once given to intellectual property or supply chain resilience.

A contest for quality and control

The scramble for training data is creating tension between those who own rich content and those who need it. Many platforms are tightening their terms of service, deploying anti-scraping technologies and litigating against unauthorised extraction. Yet the economic logic driving AI development ensures that demand for public data will continue to rise. Companies building competitive models cannot rely solely on licensed datasets or synthetic augmentation. They require the unpredictable diversity of the open web.

For organisations deploying AI, the lesson is twofold. Governance over the data that is published has never been more critical. Every article, video or product listing is potential training material. At the same time, competitive advantage lies not only in protecting internal assets but also in identifying and integrating external data streams. Decodo’s report shows that businesses with diverse external data sources are already positioning themselves for long-term success in an AI-driven market.

Collecting and curating such data responsibly is not trivial. It demands expertise in data engineering, legal compliance and ethical oversight. Companies must balance their need for rich training materials with respect for privacy, intellectual property, and regional regulations. The European Union’s evolving AI Act and data protection frameworks will test the agility of any enterprise that seeks to leverage scraped data at scale. Organisations that have grown accustomed to traditional compliance checklists will find this new environment far more complex.

Some firms are already building hybrid strategies. They blend selective scraping of public information with partnerships that provide structured feeds and formal licences. Others invest in creating proprietary datasets by encouraging user-generated content under clear terms. The objective is not simply to amass data but to capture diversity, voices, images and interactions that reflect the real world rather than a narrow slice of it. This is the material that helps an AI system avoid bias and understand context.

Strategic responses for the AI enterprise

Senior executives cannot leave these questions to technical teams alone. The scale of AI adoption means data strategy now sits alongside capital investment and market expansion in the boardroom. Decisions about whether to license content, build internal datasets or participate in industry consortia for shared data access have direct implications for competitive positioning and regulatory exposure.

The Most Scraped Websites report illustrates how quickly priorities can shift. TikTok’s 321 per cent surge and the rapid rise of newcomers like Coupang and ScienceDirect reveal how the landscape changes when AI development needs evolve. ScienceDirect’s inclusion points to the value of academic and scientific content. At the same time, the presence of e-commerce giants such as Amazon and eBay confirms that transactional data remains essential for recommendation engines and supply chain optimisation.

Executives should also consider how data scarcity might affect innovation. If access to diverse public data narrows due to legal restrictions or aggressive platform defences, the cost of training cutting-edge models will rise. Smaller firms may struggle to compete, reinforcing the dominance of those with the resources to negotiate data-sharing agreements or to build expansive proprietary datasets. The balance between open data and commercial protection will shape the next phase of AI growth.

The practical response requires investment in robust data pipelines, governance frameworks and multidisciplinary teams. Engineers who design scalable ingestion systems must work closely with legal specialists and ethicists who understand the evolving regulatory landscape. Board-level oversight is essential to ensure that competitive ambition does not outpace responsible practice. The most successful organisations will be those that treat data strategy as an integrated discipline rather than a technical afterthought.

Data as the foundation of AI leadership

The story told by Decodo’s research is not about scraping alone. It is a window into the future of artificial intelligence. As models grow more capable and more ambitious, the hunger for diverse, high-quality information will intensify. Enterprises that secure access to this material and manage it with care will set the pace for the next decade of innovation.

The rise of video-first scraping is therefore more than a trend in digital behaviour. It signals a new phase in the relationship between content and computation. Data has always shaped technology, but the scale and complexity now required are unprecedented. For leaders seeking to build or deploy AI systems, the challenge is to navigate this environment with both creativity and discipline.

Those who succeed will combine technical excellence with strategic foresight. They will understand that every dataset carries not just opportunity but responsibility. They will recognise that the richest insights often lie in the least structured sources, and that the ability to harness them, legally, ethically and efficiently, will define competitive advantage.

TikTok’s rapid ascent to the top of the scraping league table is not an isolated event. It is a signal that the next breakthroughs in artificial intelligence will depend as much on how we collect and govern data as on the algorithms we write. The organisations that grasp this reality now will be the ones shaping the capabilities of AI in the years to come.