Creating an Automated Web Crawler in PHP from Scratch

Web crawling is the process of programmatically fetching and extracting data from websites. It‘s useful for a variety of applications such as scraping product data from ecommerce sites, archiving web pages, analyzing content across many sites, and more.

Navi.

PHP is a great language for web crawling for several reasons:

As a server-side language running on almost 80% of websites, PHP has robust built-in functionality for making HTTP requests and processing HTML content
PHP is easy to learn with a gradual learning curve, making it accessible to beginners
There are many PHP libraries available to simplify common web crawling tasks
PHP scripts can be automated to run on a schedule for continual data extraction

In this guide, we‘ll walk through building an automated web crawler in PHP from scratch. By the end you‘ll have your own functional web scraper that can extract data from sites and run on autopilot.

Components of a Web Crawler

At a high level, a web crawler needs to:

Fetch the HTML of web pages by making HTTP requests to URLs
Parse the retrieved HTML to extract relevant data and find new pages to crawl
Manage crawling multiple pages by recursively following links
Store the extracted data in a structured format

We‘ll implement each of these components in PHP to create our automated web crawler.

Setting Up the Project

To get started, we‘ll set up a new PHP project for the web crawler. Create a new directory and add the following files:

composer.json: Configuration file for dependencies and autoloading
crawler.php: Main PHP script that will perform the crawling
utilities.php: Helper functions used by the crawler

We‘ll use Composer, a dependency manager for PHP, to install libraries for making HTTP requests and parsing HTML. Open composer.json and add:

{
  "require": {
    "ext-curl": "*",
    "voku/simple_html_dom": "^4.8"    
  },
  "autoload": {
    "files": ["utilities.php"]
  }
}

This specifies that our project requires the PHP cURL extension and the voku/simple_html_dom HTML parsing library. We‘ve also set up autoloading for the utilities.php file.

Install the dependencies by running:

composer install

Fetching Pages with cURL

The first step in crawling is fetching the HTML content of pages. We‘ll use the cURL library, which allows making HTTP requests in PHP.

In utilities.php add a function to fetch a URL and return the HTML:

function fetchHTML($url) {
  $ch = curl_init();

  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

  $html = curl_exec($ch);

  curl_close($ch);

  return $html;
}

This function initializes a new cURL session, sets the target URL, executes the request, and returns the response HTML.

Extracting Data from HTML

Once we have the HTML of a page, we need to parse it and extract the desired data. There are a few different PHP libraries for HTML parsing and DOM traversal, but we‘ll use voku/simple_html_dom for its simplicity.

Add a new function to utilities.php to parse HTML and extract data:

use voku\helper\HtmlDomParser;

function extractData($html) {
  $dom = HtmlDomParser::str_get_html($html);

  $data = [];

  foreach($dom->find(‘a‘) as $link) {
    $data[] = [
      ‘text‘ => $link->plaintext,
      ‘href‘ => $link->href
    ];
  }

  $dom->clear(); 
  unset($dom);

  return $data;
}

This function accepts an HTML string, loads it into a DOM object using voku/simple_html_dom, finds all the link elements, and extracts their text and URLs into an array.

We can customize this function to extract whatever data we want from the HTML. voku/simple_html_dom provides jQuery-like selectors and traversal methods for finding the right elements.

Crawling Multiple Pages

A key part of a web crawler is the ability to navigate multiple pages by following links. We‘ll set up our crawler to recursively follow links up to a maximum depth.

In crawler.php add the following:

require ‘vendor/autoload.php‘;

function crawlPage($url, $depth = 0, $maxDepth = 2) {
  if($depth > $maxDepth) {
    return;
  }

  $html = fetchHTML($url);
  $data = extractData($html);

  foreach($data as $item) {
    if(!empty($item[‘href‘])) {
      crawlPage($item[‘href‘], $depth + 1, $maxDepth);
    }
  }

  // Process extracted data here
  print_r($data); 
}

crawlPage("https://example.com");

Let‘s break this down:

crawlPage is a recursive function that accepts a URL to crawl, the current depth, and the maximum depth
We first check if the maximum depth has been exceeded, and if so, return to stop further recursion
The page HTML is fetched using the fetchHTML function and then passed to extractData to parse out the desired data
We loop through the extracted data, and for each URL found, recursively call crawlPage to crawl that linked page
Finally, we can process the extracted data, such as formatting and saving it. For now we‘re just printing it out.

The script kicks off the crawling process by calling crawlPage with the initial URL to start crawling from.

We‘ve set a maximum recursion depth to avoid crawling too many pages and potentially getting stuck in an infinite loop.

Saving Extracted Data

As data is extracted, we‘ll want to save it somewhere for later analysis and use. For this example, we‘ll store the extracted link data in a database.

First, create a new MySQL database and table to hold the links:

CREATE DATABASE crawler;

USE crawler;

CREATE TABLE links (
  id INT AUTO_INCREMENT PRIMARY KEY, 
  url VARCHAR(2048),
  text VARCHAR(255)
);

In utilities.php add functions to connect to the database and insert links:

function getDBConnection() {
  $host = "localhost";
  $username = "root";  
  $password = "";
  $dbname = "crawler";

  return new PDO("mysql:host=$host;dbname=$dbname", $username, $password);
}

function insertLink($url, $text) {
  $dbh = getDBConnection();

  $stmt = $dbh->prepare("INSERT INTO links (url, text) VALUES (?, ?)");
  $stmt->execute([$url, $text]);

  $dbh = null;
}

And update crawlPage to save the extracted links:

foreach($data as $item) {
  insertLink($item[‘href‘], $item[‘text‘]);

  if(!empty($item[‘href‘])) {
    crawlPage($item[‘href‘], $depth + 1, $maxDepth);
  }
}

Now when you run php crawler.php, the crawler will fetch pages, extract links, and save them to the database. You can view the saved data by connecting to the database and querying the links table.

Automating the Crawler

To turn our PHP crawler into a fully automated solution, we need to set it up to run continuously or on a set schedule without manual intervention.

One option is to trigger the crawler script via a cron job. Cron is a Unix utility for scheduling scripts to execute periodically.

To set up a cron job for the crawler:

In your terminal run crontab -e to open your crontab file
Add an entry to execute crawler.php at your desired frequency. For example, to run it every hour:
```
0 * * * * /usr/bin/php /path/to/crawler.php >> /path/to/crawler.log
```
This will run crawler.php at the top of every hour and append the output to crawler.log.
Save and exit the crontab file

Make sure the PHP script can run without errors and halts gracefully if any issues come up during execution. Adding status logging is also a good idea to be able to monitor the health of the crawler.

With the cron job set up, the crawler will now automatically run at the scheduled time to fetch new data. You can inspect the log file to view crawler activity and check the database to see newly extracted records.

Taking it Further

We‘ve implemented a basic but fully functional PHP web crawler that extracts link data and runs automatically!

There are many ways we can expand its capabilities:

Add respect for robots.txt files that specify what pages are allowed to be crawled
Implement rate limiting to avoid overloading servers
Set up a queueing system to manage a high volume of URLs to crawl
Use more advanced HTML parsing techniques like regular expressions or XPath
Integrate natural language processing to analyze extracted text content
Leverage a headless browser to crawl JavaScript-rendered pages
Distribute crawling across multiple machines for better performance and scalability

The beauty of writing your own web crawler is you can customize it to your exact needs and incrementally add functionality over time.

Codeless Alternatives

Writing and maintaining your own web scraper is a great way to extract web data while retaining full control and flexibility. But it does require substantial development time and technical know-how.

No-code web scraping tools, like Octoparse, allow non-programmers to easily scrape websites without writing any code. With a visual point-and-click interface, you can create automated scrapers to extract data from any site in minutes.

Octoparse also provides advanced features out-of-the-box like IP rotation, XPath parsing, scheduled crawling, and direct saving to databases or files. For projects where you need to quickly stand up a production-quality web crawler, a visual scraping tool is an excellent option.

Conclusion

Web crawling with PHP is a powerful technique for automating data extraction from websites. With some basic building blocks — HTTP requests, HTML parsing, and recursive linking — you can create a fully functional web crawler in a short amount of time.

In this guide, we walked through the key components of a PHP web scraper:

Using cURL to programmatically fetch web pages
Parsing HTML responses with DOM traversal libraries to extract structured data
Recursively crawling multiple pages by following links
Storing extracted data in a database
Scheduling the PHP crawler script to run automatically on a set schedule

We also touched on ways to enhance the crawler like respecting robots.txt, handling JavaScript-rendered content, scaling with queues and distributed systems.

If you‘re new to web scraping, starting with a simple PHP crawler is a great way to familiarize yourself with the core concepts. As you gain experience, you can layer on more advanced functionality to build sophisticated web crawlers.

Of course, if you need to quickly extract web data without any coding, tools like Octoparse provide an intuitive visual interface to create automated web scrapers.

No matter your approach, web crawling is an incredibly useful skill to automate data collection from online sources. Hopefully this guide has provided you with a solid foundation to start building your own web crawlers in PHP!