Web scraping is the process of automatically extracting data and content from websites using a program or script. It allows you to gather information from online sources much faster than manual copying and pasting. Web scraping has many useful applications, from collecting pricing data for market research to aggregating news articles or sports scores.
In this in-depth tutorial, we‘ll walk through how to scrape the web using Visual Basic (VB) and the .NET framework. By the end, you‘ll have a solid foundation to start building your own web scrapers to collect data for any purpose.
Setting Up Your Visual Basic Web Scraping Project
To get started, you‘ll need:
- Visual Studio (2019 or later) with the .NET desktop development workload installed
- The HtmlAgilityPack library for parsing HTML
- The PuppeteerSharp library for scraping dynamic websites and single-page apps
First, open Visual Studio and create a new Windows Forms App (.NET Framework) project in VB. Give it a name like "WebScraper" and choose a location to save it.
Next, you need to install the libraries you‘ll use via the NuGet package manager:
- In the Solution Explorer, right-click your project
- Select "Manage NuGet Packages…"
- Search for and install HtmlAgilityPack and PuppeteerSharp
Now you have the basic setup ready to start coding your scraper!
Scraping a Static Website with HtmlAgilityPack
For our first example, we‘ll scrape a simple static Wikipedia article. Our goal is to extract the main title, the introductory paragraph, and all the hyperlinks from the article.
Add a button to your form and double-click it to open the code editor. Add the following code:
Imports HtmlAgilityPack
Private Sub ScrapeButton_Click(sender As Object, e As EventArgs) Handles ScrapeButton.Click
Dim url As String = "https://en.wikipedia.org/wiki/Web_scraping"
Dim web As New HtmlWeb()
Dim doc As HtmlDocument = web.Load(url)
Dim titleNode As HtmlNode = doc.DocumentNode.SelectSingleNode("//h1")
Dim introNode As HtmlNode = doc.DocumentNode.SelectSingleNode("//div[@class=‘mw-parser-output‘]/p[1]")
Dim linkNodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//a[@href]")
Dim title As String = titleNode.InnerText
Dim intro As String = introNode.InnerText
Dim links As New List(Of String)
For Each link As HtmlNode In linkNodes
links.Add(link.GetAttributeValue("href", ""))
Next
ResultsTextBox.Text = title & vbNewLine & vbNewLine & intro & vbNewLine & vbNewLine & String.Join(vbNewLine, links)
End Sub
This code does the following:
- Imports the HtmlAgilityPack namespace so we can use its types
- Defines a string URL of the Wikipedia article to scrape
- Creates a new HtmlWeb client to download the web page
- Loads the HTML document from the URL
- Uses XPath queries to select the title (
//h1
), intro paragraph (//div[@class=‘mw-parser-output‘]/p[1]
), and all hyperlinks (//a[@href]
) - Extracts the text and href values from those nodes
- Joins the results and displays them in a TextBox control
The HtmlAgilityPack library makes it very easy to target specific elements on the page using XPath syntax. You can craft queries to select elements by tag name, attribute values, position in the document tree, and more. Refer to the HtmlAgilityPack selector documentation for more details and examples.
When you run this code, you should see the extracted title, intro, and links populated in your form. Congrats, you just scraped your first web page in VB!
Scraping a Dynamic Website with PuppeteerSharp
Many modern websites are dynamic or single-page applications (SPAs) that load content asynchronously using JavaScript. Inspecting their raw HTML source doesn‘t contain the actual data you see in the browser.
To scrape dynamic pages, you need a headless browser that can execute JavaScript and wait for elements to load before extracting them. That‘s where PuppeteerSharp comes in. It provides a high-level API to control a headless Chrome browser.
Let‘s test it out by scraping news headlines from a dynamic site like the Wall Street Journal (wsj.com).
Imports PuppeteerSharp
Private Async Sub ScrapeWSJ_Click(sender As Object, e As EventArgs) Handles ScrapeWSJ.Click
Dim browser As IBrowser = Await Puppeteer.LaunchAsync()
Dim page As IPage = Await browser.NewPageAsync()
Dim url As String = "https://www.wsj.com"
Await page.GoToAsync(url)
Await page.WaitForSelectorAsync(".WSJTheme--headline--unZqjb45")
Dim headlines As String() = Await page.EvaluateFunctionAsync(Of String())("() => Array.from(document.querySelectorAll(‘.WSJTheme--headline--unZqjb45, h3.WSJTheme--headline--unZqjb45‘), a => a.innerText)")
Await browser.CloseAsync()
ResultsTextBox.Text = String.Join(vbNewLine & vbNewLine, headlines)
End Sub
Walking through this code:
- Launch a new browser instance
- Create a new page and navigate to wsj.com
- Wait for the headline elements to load (identified by their specific class name)
- Execute JavaScript in the page context to select the headline elements and extract their text
- Close the browser
- Display the scraped headlines, separated by blank lines
With PuppeteerSharp, the actual web scraping is done by JavaScript code that runs in the browser context (inside the EvaluateFunctionAsync
call). This gives you the full power of the browser‘s JavaScript engine and DOM access to collect the data you need. The VB code simply launches the browser, tells it what page to load and what elements to wait for, then retrieves the result.
Note the Async
keyword used throughout – most PuppeteerSharp methods are asynchronous and return Task objects. In VB, you have to use Await
to wait for those tasks to complete and access their results.
Best Practices and Considerations
When scraping websites, there are some important things to keep in mind:
- Respect the site‘s robots.txt file. This specifies which pages are allowed to be scraped.
- Don‘t overwhelm the site with requests. Delay your calls (e.g. with a While loop) to avoid hammering the server.
- Set the User-Agent header of your web client to something descriptive. Some sites may block unknown agents.
- Gracefully handle any errors or unexpected results. Use try/catch blocks and check for empty results.
- Data formats, page structures, and site policies change over time. Be prepared to update and refactor your scraping code periodically.
- Consider the legal implications. Although scraping public data is generally allowed, some sites may prohibit it in their terms of service.
Real-World Use Cases
Web scraping has countless applications across many fields. Some examples:
- E-commerce businesses scraping competitor pricing data to inform their own pricing models
- Financial firms collecting historical stock prices, analyst ratings, SEC filings, etc. to drive algorithms
- Data scientists gathering training data for machine learning models
- Academics conducting research by analyzing social media posts or online news articles
- Sports analysts aggregating stats, scores, betting odds, etc. to identify trends
- Non-profits monitoring events, petitions, or public opinion related to their causes
Conclusion
Hopefully this guide gave you a solid introduction to the art of web scraping using Visual Basic and the .NET platform. We covered the fundamentals of HTTP clients, HTML parsing with HtmlAgilityPack, dynamic website scraping with PuppeteerSharp, and some general best practices.
Remember, with great power comes great responsibility. Use your web scraping skills ethically and respect the intellectual property rights of content owners. Happy scraping!