Web crawling is the process of programmatically fetching and extracting data from websites. It‘s useful for a variety of applications such as scraping product data from ecommerce sites, archiving web pages, analyzing content across many sites, and more.
PHP is a great language for web crawling for several reasons:
- As a server-side language running on almost 80% of websites, PHP has robust built-in functionality for making HTTP requests and processing HTML content
- PHP is easy to learn with a gradual learning curve, making it accessible to beginners
- There are many PHP libraries available to simplify common web crawling tasks
- PHP scripts can be automated to run on a schedule for continual data extraction
In this guide, we‘ll walk through building an automated web crawler in PHP from scratch. By the end you‘ll have your own functional web scraper that can extract data from sites and run on autopilot.
Components of a Web Crawler
At a high level, a web crawler needs to:
- Fetch the HTML of web pages by making HTTP requests to URLs
- Parse the retrieved HTML to extract relevant data and find new pages to crawl
- Manage crawling multiple pages by recursively following links
- Store the extracted data in a structured format
We‘ll implement each of these components in PHP to create our automated web crawler.
Setting Up the Project
To get started, we‘ll set up a new PHP project for the web crawler. Create a new directory and add the following files:
composer.json
: Configuration file for dependencies and autoloadingcrawler.php
: Main PHP script that will perform the crawlingutilities.php
: Helper functions used by the crawler
We‘ll use Composer, a dependency manager for PHP, to install libraries for making HTTP requests and parsing HTML. Open composer.json
and add:
{
"require": {
"ext-curl": "*",
"voku/simple_html_dom": "^4.8"
},
"autoload": {
"files": ["utilities.php"]
}
}
This specifies that our project requires the PHP cURL extension and the voku/simple_html_dom HTML parsing library. We‘ve also set up autoloading for the utilities.php
file.
Install the dependencies by running:
composer install
Fetching Pages with cURL
The first step in crawling is fetching the HTML content of pages. We‘ll use the cURL library, which allows making HTTP requests in PHP.
In utilities.php
add a function to fetch a URL and return the HTML:
function fetchHTML($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
This function initializes a new cURL session, sets the target URL, executes the request, and returns the response HTML.
Extracting Data from HTML
Once we have the HTML of a page, we need to parse it and extract the desired data. There are a few different PHP libraries for HTML parsing and DOM traversal, but we‘ll use voku/simple_html_dom for its simplicity.
Add a new function to utilities.php
to parse HTML and extract data:
use voku\helper\HtmlDomParser;
function extractData($html) {
$dom = HtmlDomParser::str_get_html($html);
$data = [];
foreach($dom->find(‘a‘) as $link) {
$data[] = [
‘text‘ => $link->plaintext,
‘href‘ => $link->href
];
}
$dom->clear();
unset($dom);
return $data;
}
This function accepts an HTML string, loads it into a DOM object using voku/simple_html_dom, finds all the link elements, and extracts their text and URLs into an array.
We can customize this function to extract whatever data we want from the HTML. voku/simple_html_dom provides jQuery-like selectors and traversal methods for finding the right elements.
Crawling Multiple Pages
A key part of a web crawler is the ability to navigate multiple pages by following links. We‘ll set up our crawler to recursively follow links up to a maximum depth.
In crawler.php
add the following:
require ‘vendor/autoload.php‘;
function crawlPage($url, $depth = 0, $maxDepth = 2) {
if($depth > $maxDepth) {
return;
}
$html = fetchHTML($url);
$data = extractData($html);
foreach($data as $item) {
if(!empty($item[‘href‘])) {
crawlPage($item[‘href‘], $depth + 1, $maxDepth);
}
}
// Process extracted data here
print_r($data);
}
crawlPage("https://example.com");
Let‘s break this down:
crawlPage
is a recursive function that accepts a URL to crawl, the current depth, and the maximum depth- We first check if the maximum depth has been exceeded, and if so, return to stop further recursion
- The page HTML is fetched using the
fetchHTML
function and then passed toextractData
to parse out the desired data - We loop through the extracted data, and for each URL found, recursively call
crawlPage
to crawl that linked page - Finally, we can process the extracted data, such as formatting and saving it. For now we‘re just printing it out.
The script kicks off the crawling process by calling crawlPage
with the initial URL to start crawling from.
We‘ve set a maximum recursion depth to avoid crawling too many pages and potentially getting stuck in an infinite loop.
Saving Extracted Data
As data is extracted, we‘ll want to save it somewhere for later analysis and use. For this example, we‘ll store the extracted link data in a database.
First, create a new MySQL database and table to hold the links:
CREATE DATABASE crawler;
USE crawler;
CREATE TABLE links (
id INT AUTO_INCREMENT PRIMARY KEY,
url VARCHAR(2048),
text VARCHAR(255)
);
In utilities.php
add functions to connect to the database and insert links:
function getDBConnection() {
$host = "localhost";
$username = "root";
$password = "";
$dbname = "crawler";
return new PDO("mysql:host=$host;dbname=$dbname", $username, $password);
}
function insertLink($url, $text) {
$dbh = getDBConnection();
$stmt = $dbh->prepare("INSERT INTO links (url, text) VALUES (?, ?)");
$stmt->execute([$url, $text]);
$dbh = null;
}
And update crawlPage
to save the extracted links:
foreach($data as $item) {
insertLink($item[‘href‘], $item[‘text‘]);
if(!empty($item[‘href‘])) {
crawlPage($item[‘href‘], $depth + 1, $maxDepth);
}
}
Now when you run php crawler.php
, the crawler will fetch pages, extract links, and save them to the database. You can view the saved data by connecting to the database and querying the links
table.
Automating the Crawler
To turn our PHP crawler into a fully automated solution, we need to set it up to run continuously or on a set schedule without manual intervention.
One option is to trigger the crawler script via a cron job. Cron is a Unix utility for scheduling scripts to execute periodically.
To set up a cron job for the crawler:
In your terminal run
crontab -e
to open your crontab fileAdd an entry to execute
crawler.php
at your desired frequency. For example, to run it every hour:0 * * * * /usr/bin/php /path/to/crawler.php >> /path/to/crawler.log
This will run
crawler.php
at the top of every hour and append the output tocrawler.log
.Save and exit the crontab file
Make sure the PHP script can run without errors and halts gracefully if any issues come up during execution. Adding status logging is also a good idea to be able to monitor the health of the crawler.
With the cron job set up, the crawler will now automatically run at the scheduled time to fetch new data. You can inspect the log file to view crawler activity and check the database to see newly extracted records.
Taking it Further
We‘ve implemented a basic but fully functional PHP web crawler that extracts link data and runs automatically!
There are many ways we can expand its capabilities:
- Add respect for
robots.txt
files that specify what pages are allowed to be crawled - Implement rate limiting to avoid overloading servers
- Set up a queueing system to manage a high volume of URLs to crawl
- Use more advanced HTML parsing techniques like regular expressions or XPath
- Integrate natural language processing to analyze extracted text content
- Leverage a headless browser to crawl JavaScript-rendered pages
- Distribute crawling across multiple machines for better performance and scalability
The beauty of writing your own web crawler is you can customize it to your exact needs and incrementally add functionality over time.
Codeless Alternatives
Writing and maintaining your own web scraper is a great way to extract web data while retaining full control and flexibility. But it does require substantial development time and technical know-how.
No-code web scraping tools, like Octoparse, allow non-programmers to easily scrape websites without writing any code. With a visual point-and-click interface, you can create automated scrapers to extract data from any site in minutes.
Octoparse also provides advanced features out-of-the-box like IP rotation, XPath parsing, scheduled crawling, and direct saving to databases or files. For projects where you need to quickly stand up a production-quality web crawler, a visual scraping tool is an excellent option.
Conclusion
Web crawling with PHP is a powerful technique for automating data extraction from websites. With some basic building blocks — HTTP requests, HTML parsing, and recursive linking — you can create a fully functional web crawler in a short amount of time.
In this guide, we walked through the key components of a PHP web scraper:
- Using cURL to programmatically fetch web pages
- Parsing HTML responses with DOM traversal libraries to extract structured data
- Recursively crawling multiple pages by following links
- Storing extracted data in a database
- Scheduling the PHP crawler script to run automatically on a set schedule
We also touched on ways to enhance the crawler like respecting robots.txt
, handling JavaScript-rendered content, scaling with queues and distributed systems.
If you‘re new to web scraping, starting with a simple PHP crawler is a great way to familiarize yourself with the core concepts. As you gain experience, you can layer on more advanced functionality to build sophisticated web crawlers.
Of course, if you need to quickly extract web data without any coding, tools like Octoparse provide an intuitive visual interface to create automated web scrapers.
No matter your approach, web crawling is an incredibly useful skill to automate data collection from online sources. Hopefully this guide has provided you with a solid foundation to start building your own web crawlers in PHP!