ИНТЕГРИРАНЕ НА ДАННИ ОТ ИНТЕРНЕТ В ОФИЦИАЛНА СТАТИСТИКА: ПРОЕКТЪТ ,,ДОВЕРЕНА УМНА СТАТИСТИКА (TSS) - МРЕЖА ЗА ИЗУЧАВАНЕ НА УЕБПРОСТРАНСТВОТО (WIN)“
INTEGRATING WEB DATA INTO OFFICIAL STATISTICS: THE ‘TRUSTED SMART STATISTICS (TSS) -WEB INTELLIGENCE NETWORK (WIN)’ PROJECT
Author(s): Mariana Angelova, Kostadin GeorgievSubject(s): Economy, National Economy, Business Economy / Management, ICT Information and Communications Technologies
Published by: Национален статистически институт
Keywords: Web data integration; Official statistics; Web Intelligence Hub; Machine learning classification; Hybrid data sources;
Summary/Abstract: Introduction: The exponential growth of web data, including online job adverts, firm websites, price portals, creates an unprecedented opportunity to modernize official European statistics. Regardless of this, integrating such dynamic, unstructured sources into the rigid quality frameworks of the national statistical institutes (NSIs) remains methodologically and technically challenging. Aim and objectives: The EU-funded ESSnet Trusted Smart Statistics - Web Intelligence Network (ESSnet WIN, 2021 - 2025) united 17 European national statistical institutes to design, build and pilot the Web Intelligence Hub (WIH): a cloud platform and methodological stack able to harvest, process and quality-assure web data for routine statistical production. Methodology: The project is structured of four work packages: (WP1) governance, training and community of 70+ experts (WISER); (WP2) production-grade software for Online Job Adverts (OJA) and Online-Based Enterprise Characteristics (OBEC); (WP3) exploratory use-cases in real-estate, construction, tourism, prices and business registers; (WP4) harmonized quality and methodological guidelines anchored in the modular BREAL big-data reference architecture and layered quality-assurance workflow (landscaping→ web-scraping → sampling & data-wrangling → ML classification → validation against traditional sources).Main results: WIH is operational on AWS Athena; OJA and OBEC pipelines run quarterly for 15+ countries; OJA: 25 k - 60 k live adverts/month classified to ISCO-1 (56 - 62% accuracy) and NUTS-2 regions; unstable sources and coverage gaps remain; OBEC: 90% + URL linkage of firms to websites; 70 - 80% accuracy detecting e-commerce, multilingual or social-media presence; New pilots: monthly rent/price indices for 5 countries; early construction indicators for Germany/Sweden; hotel-price indices from Booking.com; traffic-camera counts as high-frequency economic activity proxies; 9 webinars, 27 blogs, hackathon and final conference engaged > 1 200 participants. Conclusions: Web data can enrich, not replace, traditional sources when combined with administrative registers and continuous quality control. Hybrid workflows, adaptive ML models and open-source tooling are essential for credibility and scalability. Limitations: Accuracy of ML classification, legal/ethical constraints (GDPR), volatile sources, and uneven geographic coverage constrain immediate operational use. Practical implications: NSIs can embed WIH modules for labour market monitoring, business demography, CPI augmentation and crisis indicators. Shared crawlers, harmonized quality metrics and LLM-assisted discovery will reduce duplication and costs. Originality/value: ESSnet WIN delivers the first pan-European, production-ready architecture and tested guidelines for trustworthy web-data integration into official statistics, providing a reusable blueprint for NSIs worldwide.
Journal: Статистика
- Issue Year: 2025
- Issue No: 1
- Page Range: 66-88
- Page Count: 23
- Language: Bulgarian
