How to View All Links on a Web Page using PHP

Do you often find yourself trying to extract all the links from a web page for various reasons such as data mining, backlink analysis, or just simple curiosity? In this blog post, we will discuss how to achieve that using PHP.

First, let's understand why we would use PHP for this task. Compared to other tools like BeautifulSoup or Scrapy in Python, or Cheerio in JavaScript, PHP might not be the first choice when it comes to web scraping. However, PHP has its advantages as well:

It is widely used and supported by most web hosts.
It's relatively easy to learn if you have some programming background.
It provides built-in functions for handling HTML documents.

Now, let's dive into the code. We will create a simple PHP script that accepts a URL as an argument and outputs all the unique links found on the page.

<?php

function getDomDocument($url) {
    $options = array(
        CURLOPT_RETURNTRANSFER => true,  
        CURLOPT_HEADER         => false, 
        CURLOPT_FOLLOWLOCATION => true,  
        CURLOPT_MAXREDIRS      => 10,
        CURLINFO_BUFFERSIZE    => 1024,
        CURLOPT_ENCODING       => "",
        CURLOPT_USERAGENT      => "test-agent",
        CURLOPT_AUTOREFERER  => true,
        CURLOPT_CONNECTTIMEOUT => 120,   
        CURLOPT_TIMEOUT       => 120,
    );

    $ch = curl_init($url);
    curl_setopt_array($ch, $options);
    $content = curl_exec($ch);
    $html = new DOMDocument();
    @$html->loadHTML($content);
    curl_close($ch);

    return $html;
}

function extractLinks($dom) {
    $links = array();

    foreach ($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        if (!in_array($href, $links)) {
            $links[] = $href;
        }
    }

    return $links;
}

function main() {
    if (count($argv) != 2) {
        die("Usage: php script.php <URL>\n");
    }

    $url = $argv[1];

    try {
        $dom = getDomDocument($url);
        $links = extractLinks($dom);

        print_r($links);
    } catch (Exception $e) {
        die("Error: " . $e->getMessage());
    }
}

main();

To use this script, save the code in a file named script.php, and then run it from the command line with the URL as an argument:

$ php script.php https://example.com

This script fetches the HTML content of the given URL using cURL, parses it using DOMDocument, extracts all unique links, and outputs them to the console. Remember to run this script on a local server or your development environment rather than directly from a web server due to security concerns.

Happy exploring! Let us know if you have any questions or suggestions in the comments below.

Published August, 2015