New research suggests that nearly half of companies are struggling to verify the origins of external data, raising concerns over how it is used in AI and automation.
A study of 1,000 decision-makers by data collection company Decodo found that 43 per cent of businesses had encountered external data with unclear or unverifiable origins. Other common issues included data conflicting with internal records (38 per cent), vendors unable to explain how the information was gathered (37 per cent), and outdated or incomplete datasets (35 per cent). For organisations that depend on data to inform strategic and operational decisions, these red flags present a growing challenge.
In AI-driven environments, where algorithms depend on accurate and current data to deliver reliable insights, quality concerns can undermine entire projects. As automation becomes embedded in critical decision-making, reliance on faulty or opaque datasets increases the risk of skewed results, bias, and financial losses.
The danger of opaque data sources
One of the clearest warnings from the research is the risk posed by vendors who cannot explain the provenance of their datasets. Without transparency, companies may be relying on outdated, incomplete, or duplicated information that could lead to poor decisions. Direct data collection offers more certainty, allowing businesses to determine when and how the data was gathered, and ensuring it meets specific requirements.
Vaidotas Juknys, Head of Commerce at Decodo, says that self-managed data gathering enables greater control over freshness and quality while avoiding the recurring costs of pre-packaged datasets. “If a vendor cannot walk you through their process, you are essentially buying a black box,” he says.
The survey also highlights technical barriers such as anti-bot defences, broken scraping tools, IP blocks, and datasets flagged by security teams. While these may seem operational rather than strategic issues, they directly affect an organisation’s ability to access timely and relevant data for its AI systems.
The cost of stale datasets in AI workflows
In fast-moving markets, outdated or incomplete data can have a disproportionate effect on AI models. Vytautas Savickas, CEO at Decodo, warns that stale datasets can distort insights and reduce the value of automation. Inaccurate data can perpetuate bias in machine learning models or lead to flawed forecasts in sectors such as finance, logistics, and retail.
Keeping data fresh requires regular updates, diverse sourcing, and automated validation tools that can identify anomalies before they enter production systems. AI-powered validation, in particular, can flag inconsistencies more quickly than manual processes and help organisations maintain confidence in their inputs.
The findings suggest that businesses looking to strengthen their AI and automation capabilities need to make data provenance and integrity central to their strategy. That includes thorough vetting of data providers, transparency around sourcing, and investment in systems that monitor accuracy and relevance on an ongoing basis.
As Gabrielė Verbickaitė, Product Marketing Manager at Decodo, notes, “AI is only as good as the data that fuels it. If your inputs are stale, your outputs will be too.”
With 43 per cent of companies already encountering unverifiable sources, the push for greater transparency, real-time collection, and proactive quality checks is likely to intensify. For organisations aiming to maintain trust in their AI-driven decisions, reliable data is no longer just a competitive advantage – it is a prerequisite.




