Bolivia Public Tenders
Overview
This project is focused on automating the extraction of Bolivia's public tender data. It was initiated because the official website restricted access and extraction of data on a large scale. The public tenders website implements multiple security layers to prevent automated data collection, making it difficult for investors, researchers, and analysts to access crucial information. This project overcomes these challenges through a combination of advanced web scraping, API reverse engineering, and machine learning techniques, enabling the retrieval of structured data for further analysis.
Technical Approach
Identifying Data Source & Security Mechanisms
The public tenders webpage does not provide direct access to structured data and is protected by multiple security layers, including hidden API endpoints, session-based secret tokens, and CAPTCHA verification.
API Reverse Engineering
By monitoring the behavior of the website, we discovered that it generates unique secret ID tokens to interact with the page. These IDs are embedded with JavaScript before the page fully loads, making it difficult for scrapers to detect. In this way, we created a custom scraper that intercepts scripts, extracts tokens, and reconstructs requests to access the data programmatically.
CAPTCHA solving with Convolutional Neural Network (CNN)
To further prevent automation, the website enforces a CAPTCHA challenge on every request for additional information. To address this, we trained a Convolutional Neural Network (CNN) to recognize and decode the CAPTCHA images. To reduce the need for extensive manual labeling, we implemented a data augmentation strategy for the CAPTCHA dataset, systematically transforming the images to artificially expand the training data. The trained model achieved high accuracy in bypassing the security layer.
Data Retrieval and Pipeline
For data retrieval, we opted to extract the data using a multiprocessing approach instead of an asynchronous one. Since the machine learning OCR is lightweight, we used a CPU-per-core worker to parallelize the entire data extraction process.
Dataset
Bolivia Public Tenders
Historical, ready to use Data from 2006 to 2024. Ideal for researchers and businesses. This dataset can be used for in-depth analysis, market research, forecasting, or academic studies for gaining actionable insights into Bolivia
MM
Public Tenders Data
This data was gathered and is shared and ready for analysis, providing a comprehensive snapshot of the public tenders of Bolivia,
Purpose and Impact
This project was developed in response to a national financial crisis, where transparency in public spending is crucial. Unlike most countries, in Bolivia, public tenders are hidden behind security barriers, limiting access for investors, journalists, and citizens. The next step in this project involves building a comprehensive deep analysis based on the extracted data, offering insights into this issue.
More about this
To access this dataset or want more information please contact us.
Author
Bartor S.