Build and manage Selenium web scrapers with Auto-Scrape

February 23, 2020

tldr: Auto-scrape let's you focus on writing web scraping scripts, while it takes care of logging, data persistance, data presentation and data export, all through a modern browser-based UI. It can be run locally or deployed remotely.

Here are some screencasts of the UI.
Get it on Github.

Why scrape the web?

Building a Selenium web scraper is almost a rite of passage for programmers starting out. Watching a computer fill out forms, click links and collect data before your eyes is not only a highly satisfying and suitably non-abstract exercise for beginners to complete - browser automation forms a foundation for frontend testing, can be used for automated research, and of course can be used to replace those expensive and unreliable humans to accomplish a wide range of business-related tasks.

I have built web scrapers to collect sales leads, conduct market research on the Australian tertiary education tuition market, gather market statistics for second-hand vehicle sales, and to aggregate stock market data to provide share portfolio balancing recommendations.

Experience using web scrapers for these types of projects has taught me a few things:

It's difficult to debug a scraper without propper logging.
The structure of collected data is (almost) always 2D or 3D and doesn't require re-inventing of the wheel for each application.
A client might want the ability to re-run a scraper at certain times, potentially with different parameters. Installing Python on their machine so they can run some_script.py is not the answer.
Most clients will want data delivered in a spreadsheet.
If you or a client are running multiple scrapers daily, you really want them to be running on a remote server.

In order to speed up delivery time and provide a better product for future projects, I decided to build a (open source 👍) platform that solves the common problems that arise, allowing me to focus on simply writing the scraping script.

Introducing auto-scrape

Auto-scrape is a platform for building, managing and remotely deploying web scrapers. It provides the “essential infrastructure” for web scraping while allowing developers to focus on writng Selenium web scraping scripts in a simple and familiar way.

It is built using the Flask framework and uses SQLAlchemy to interface with the SQL database of your choice.

Features include:

live progress logging
database for saving scraped data - no database experience required!
CSV export
multiple simultaneous scrapers
basic resource management
basic user authenticalion for remote deployments.

User interface overview

The dashboard provides an overview of “Active Sessions” (daemon browser instances in the process of scraping) and “Past Sessions” (that are either completed or were prematurely terminated.

Live logging

The below screencast shows live logs that allow the progress of an active session to be tracked in real time. Scraped data can also be previewed while a scraper is in progress:

Data export

The below screencast shows scraped data being exported as a CSV file:

Check out the data exported from a scrape of the YC Hacker News front page post here. If a session is prematurely terminated, the status is recorded as “Aborted”, with the logs and saved data remaining available.

Multiple simultaneous scrapers

The below screencase shows multiple simultaneous scrapers running. In the below example we have max_active_sessions = 3, but this can be easily adjusted depending on available system resources:

Questions?

Have you used auto-scrape for one of your projects, or do you have any questions? Feel free to leave a comment below, or to reach out to me directly.