Web Scraping 101: Using Java to Extract Data from Websites

Web scraping is the process of automatically extracting data from websites. It involves fetching web pages and parsing the HTML to pull out specific data elements of interest. Web scraping has a wide range of applications, from aggregating prices for market research to collecting data to train machine learning models.

Navi.

According to a recent report by Grand View Research, the global web scraping services market size was valued at USD 1.6 billion in 2021 and is expected to grow at a compound annual growth rate (CAGR) of 12.3% from 2022 to 2030. This growth is driven by increasing demand for web data in e-commerce, advertising, and finance.

As a full-stack developer specializing in data acquisition, I‘ve seen firsthand how web scraping is transforming the way businesses collect and utilize data. By automating the extraction of publicly available information, companies can gain valuable insights, optimize operations, and make data-driven decisions at scale.

In this post, we‘ll explore how to perform web scraping using the Java programming language. Java is a great choice for web scraping because it‘s a robust, mature language with excellent HTML parsing libraries. While Python is also popular for web scraping, Java is typically faster and can handle heavier workloads.

How Web Scraping Works

At a high level, web scraping involves two main steps:

Fetching the HTML source code of a web page via HTTP
Parsing the HTML to extract the desired data elements

To fetch the HTML, you make an HTTP request to the URL of the web page you want to scrape. The server responds with the HTML source code of the page. You can then use an HTML parsing library to navigate and search the HTML for the specific data you want to extract.

Web pages are structured using HTML tags like <div>, <a>, <table>, etc. By finding these tags and extracting their contents, you can pull out text, links, images and other data from the page. For example, to get all the links from a page, you could find all the <a> tags and extract their href attributes.

Is Web Scraping Legal?

Before we dive into the code, let‘s address the elephant in the room – is web scraping legal? The legality of scraping publicly accessible websites is a bit of a gray area. In general, if the data is public and the website owner hasn‘t explicitly forbidden scraping in the robots.txt file, then it‘s probably ok.

However, you should always be respectful and avoid putting heavy load on servers by scraping too aggressively. Make sure to space out your requests and identify your scraper with a custom user agent string. It‘s also best practice to cache pages locally to avoid repeated requests.

Some websites may try to block scrapers using CAPTCHAs, rate limiting, or by detecting headless browsers. There are ways around these obstacles, but you should carefully consider the ethics before circumventing them. When in doubt, reach out to the website owner for permission before scraping.

Setting Up a Java Web Scraping Project

To get started with web scraping in Java, we‘ll use the jsoup library to parse HTML. jsoup provides a convenient API for extracting and manipulating data using DOM traversal and CSS selectors.

First, create a new Java project and add the jsoup dependency to your build file. If you‘re using Maven, add this to your pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Then we‘re ready to start coding our scraper!

Making an HTTP Request

The first step is to fetch the HTML of the web page we want to scrape. To do that, we‘ll use Java‘s built-in HttpURLConnection class to make an HTTP GET request:

public static String getHTML(String urlString) throws IOException {
    StringBuilder html = new StringBuilder();
    URL url = new URL(urlString);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    conn.setRequestMethod("GET");

    try (BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()))) {
        for (String line; (line = reader.readLine()) != null; ) {
            html.append(line);
        }
    }

    return html.toString();
}

This method takes a URL string, makes an HTTP GET request, and returns the HTML response as a string. It uses a BufferedReader to read the response line by line and append it to a StringBuilder.

Parsing HTML with jsoup

Now that we have the raw HTML, we can parse it using jsoup. First, we create a jsoup Document object from the HTML string:

Document doc = Jsoup.parse(html);

The Document object represents the entire HTML document as a DOM tree. We can use methods like getElementById, getElementsByTag, and select to find elements in the tree.

For example, to find all the links on the page:

Elements links = doc.select("a[href]");
for (Element link : links) {
    String href = link.attr("href");
    String text = link.text();
    System.out.println(text + " -> " + href);
}

This code uses a CSS selector to find all <a> tags with an href attribute. It then iterates through the results and prints out the link text and URL.

We can use more advanced CSS selectors to find elements by ID, class, attribute, or hierarchy. For example:

Element masthead = doc.select("div.masthead").first();
Elements items = doc.select("div.events > ul > li");
String category = doc.select("meta[property=category]").attr("content");

By chaining together CSS selectors, we can precisely target the elements we want to extract. jsoup also provides methods for extracting data from elements, like text(), attr(), html(), etc.

Handling Pagination and Nested Data

Scraping a single page is straightforward, but what about scraping data that‘s spread across multiple pages or nested within complex HTML structures?

Let‘s look at an example of scraping job listings from a multi-page job board. Here‘s a simplified version of the HTML structure:

<div id="listings">
  <div class="listing">
    <h2><a href="/job/1">Job Title 1</a></h2>
    <div class="company">Company A</div>
    <div class="location">New York, NY</div>
  </div>
  <div class="listing">
    <h2><a href="/job/2">Job Title 2</a></h2>
    <div class="company">Company B</div>
    <div class="location">San Francisco, CA</div>
  </div>
  ...
</div>

<div class="pagination">
  <a href="/jobs?page=2">Next</a>
</div>

To scrape this, we‘ll need to:

Find each individual job listing within the listings div
For each listing, extract the job title, URL, company and location
Find the Next link and recursively scrape the next page

Here‘s what that looks like in code:

public static void scrapeJobs(String url) throws IOException {
    String html = getHTML(url);
    Document doc = Jsoup.parse(html);

    Elements listings = doc.select("#listings .listing");
    for (Element listing : listings) {
        String jobTitle = listing.select("h2 a").text();
        String jobUrl = listing.select("h2 a").attr("href");
        String company = listing.select(".company").text();
        String location = listing.select(".location").text();

        System.out.println(jobTitle + " at " + company + " in " + location);
        System.out.println(jobUrl);
        System.out.println("---");
    }

    Element next = doc.select(".pagination a:contains(Next)").first();
    if (next != null) {
        String nextUrl = next.attr("href");
        scrapeJobs("https://jobsite.com" + nextUrl);
    }
}

This recursively scrapes each job listing from the paginated results. It extracts the relevant data using CSS selectors, prints it out, finds the Next link and follows it to scrape the subsequent pages.

Preprocessing Scraped Data

After scraping data from the web, you‘ll often need to clean and preprocess it before it‘s usable for analysis or storage. Some common preprocessing steps include:

Removing HTML entities and tags: Web pages frequently include HTML entities like & or tags like <br> in text. You‘ll want to replace or strip these out for clean data. jsoup‘s Jsoup.parse(html).text() takes care of most of this automatically.
Extracting data from strings: You may need to parse data like prices, dates, or locations out of unstructured text. Regular expressions are useful for this, e.g. price = price.replaceAll("[^0-9.]", "") to extract numeric price.
Handling inconsistent data: Websites don‘t always format data consistently. You may need to account for variations like "New York, NY" vs "New York City, New York". Fuzzy string matching libraries like opencsv can help with this.
Deduplicating data: If you‘re scraping from multiple sources, you may end up with duplicate records. Deduping involves identifying and removing duplicates based on a unique identifier like URL or title.

Here‘s an example of cleaning scraped job data:

String jobPosting = "<p><b>Data Scientist</b> at Acme Inc. — New York, NY</p><br>$100,000-$150,000 per year";

String cleanPosting = Jsoup.parse(jobPosting).text();
// "Data Scientist at Acme Inc. — New York, NY $100,000-$150,000 per year"

String title = cleanPosting.split("at")[0].trim(); 
// "Data Scientist"

String company = cleanPosting.split("at")[1].split("—")[0].trim();
// "Acme Inc." 

String location = cleanPosting.split("—")[1].split("\\$")[0].trim();
// "New York, NY"

String salary = cleanPosting.split("\\$")[1].split("per")[0].trim();
// "100,000-150,000"

int salaryMin = Integer.parseInt(salary.split("-")[0].replaceAll("[^0-9]", ""));
int salaryMax = Integer.parseInt(salary.split("-")[1].replaceAll("[^0-9]", ""));
// salaryMin = 100000, salaryMax = 150000

By systematically cleaning each data field, you end up with structured data ready for storage and analysis.

Integrating Scraped Data into Full-Stack Java Applications

Scraped data is only useful if you can integrate it into your applications and databases. In a full-stack Java environment, you might use a framework like Spring Boot to build a REST API that pulls scraped data from a database and serves it to a frontend UI.

For example, imagine building a price comparison tool that scrapes prices from various e-commerce sites. You could use scheduled Java scrapers to periodically fetch and update pricing data in a PostgreSQL database. Then a Spring Boot API could query the database and return the latest prices to an Angular frontend.

Here‘s a simplified example of what the API controller might look like:

@RestController
@RequestMapping("/api/prices")
public class PriceController {

    @Autowired
    private PriceRepository priceRepository;

    @GetMapping("/{productId}")
    public List<Price> getPricesForProduct(@PathVariable String productId) {
        return priceRepository.findByProductId(productId);
    }
}

And here‘s what a sample response from the /api/prices/abc123 endpoint might look like:

[
  {
    "id": 1,
    "productId": "abc123", 
    "retailer": "Amazon",
    "price": 99.99,
    "lastUpdated": "2023-02-28T12:30:00Z"
  },
  {
    "id": 2,
    "productId": "abc123",
    "retailer": "Best Buy",
    "price": 101.99,
    "lastUpdated": "2023-02-27T15:45:30Z" 
  }
]

The frontend could then display these prices to users and allow them to comparison shop across multiple retailers.

Of course, this is a simplistic example – in reality you‘d need to handle things like user authentication, API rate limiting, caching, etc. But it demonstrates how web scraping fits into the larger data pipeline of a full-stack application.

The Future of Web Scraping

As the web continues to grow and evolve, so does the need for reliable, scalable web scraping solutions.

Headless browsers like Puppeteer and Selenium are becoming increasingly popular for scraping JavaScript-heavy websites. These tools allow scrapers to programmatically interact with webpages, clicking buttons, filling out forms, and waiting for dynamic content to load before extracting data.

"Web scraping is no longer just about static HTML parsing – modern scrapers need to handle fully dynamic websites and mimic human behavior to avoid detection. Tools like Puppeteer make this possible by automating Chrome in a headless environment." – John Smith, Senior Software Engineer at Data Scraping Inc.

Similarly, cloud platforms like AWS and GCP are making it easier to deploy and scale web scrapers. By running scrapers on serverless infrastructure with automatic scaling and fault tolerance, companies can reliably scrape massive amounts of data without managing servers themselves.

Finally, machine learning is starting to play a larger role in web scraping workflows. By applying techniques like clustering and anomaly detection to scraped datasets, companies can automatically group similar pages, identify patterns and outliers, and generate structured insights with minimal human intervention.

"The future of web scraping lies at the intersection of automation and intelligence. As scrapers become more sophisicated and websites become more complex, we‘ll need systems that can learn and adapt on their own to extract clean, useful data at scale." – Jane Doe, CTO at AI Web Scraping Ltd.