Use of Scanner Data and Webscraping in Price Statistics
Objective and implementation
The calculation of price indices has a tradition of more than one hundred years at Statistics Austria; time series of the consumer price index date back to 1958. Data are collected on a primary statistical basis organised by Statistics Austria and takes place in shops, as well as by e-mail, by telephone or on the internet. In order to constantly improve data collection, new data sources are regularly evaluated to supplement the indices. Two data sources have already been successfully integrated into the statistical production process:
- Collecting prices on the internet: The large range of products on the internet makes manual selection and data collection difficult, which is why automated price and data collection by means of so-called webcrawlers or webscrapers is used. These programmes capture predefined variables (e.g. product descriptions, quantities, sizes, and prices) and store them in a structured form for further processing. After initial experience with point & click tools, the focus is now on in-house programming of web scrapers in R and Python. The scripts are placed in a general framework that enables monitoring through automatically generated e-mails and reports - so that the smallest changes on the website are noticed and handled immediately.
- Scanner data: Scanner cash registers capture each purchase electronically and record not only the price, but also the product name, the sales discount and the quantity purchased. Place and date of the sale are also recorded and can be evaluated. In order to use scanner data, it is necessary to classify all products according to the ECOICOP-classes in advance. A framework of different machine learning algorithms (from Naive Bayes to deep learning) has been designed for this purpose and delivers accuracies above 90 percent. Products that are classified inconsistently by different algorithms indicate allocation errors and are adjusted manually.
The implementation of scanner data as a data source for the consumer price index is planned for January 2022. From then on, the local price-collection will be partially replaced by scanner data. However, due to the Covid-related price-collection cancellations in spring and winter 2020, it was necessary to use scanner data in advance in order to compensate for missing price reports.
Since 2019, a new CPI-Regulation regulates the delivery of scanner data from the large supermarket chains to Statistics Austria. Among other things, the regulation legally defines the sampling units, the periodicity, the time range and the variables of the data delivery.
The following general conditions should be taken into account for webscraping:
- Webscraping programmes should be clearly identifiable for website providers through the use of a fixed designation (UserAgent description).
- Technical hurdles (captchas, IP-blocks) on the provider's side must not be bypassed.
- Webscraping must not be used to duplicate the database of a website provider in order to make it available elsewhere.
- Webscraping processes must not negatively impact the performance of the website’s provider's infrastructure.
To meet these requirements, our web scraping activities comply with the guidelines developed by Eurostat:
Innovation within the project
The project is innovative as it uses new data sources and improves the quality of price indices: data acquisition is more efficient, more up-to-date (among other things, no late price reports) and can cover higher data volumes. In the long term, scanner data can be used to cover a complete range of goods instead of a sample (limited to food and drugstore trade for the time being).
Interpretation of the results
The new data sources and their high data volume allow for different price index calculation methods, which can lead to different index properties. Their advantages and disadvantages have to be analysed and evaluated before deciding on their ultimate use for price statistics.
This may lead to higher price index volatility. Whether and to what extent there are deviations from results already published as official statistics is still unclear at the moment.
Further information, project results
For webscraping, the aim is to include the first price indices based on prices collected automatically via web scraping in the official index in the course of 2021. Mobile phone tariffs have been selected as the initial area of application for this purpose.