Web scraping has become an essential tool for extracting data from the ever-expanding universe of e-commerce websites. As online sales continue to grow rapidly, access to accurate and up-to-date product information is more critical than ever for market research, price monitoring, and competitive analysis.
E-Commerce Continues to Boom
Global e-commerce sales reached $5.7 trillion in 2022, an increase of 11% over 2021. By 2025, e-commerce is projected to account for nearly 25% of total retail sales worldwide.
Year | Global E-Commerce Sales (Trillions) | Growth |
---|---|---|
2020 | $4.2 | 25.7% |
2021 | $5.2 | 23.8% |
2022 | $5.7 | 11.0% |
2025 | $8.1 (projected) | – |
Source: Statista Global Retail E-Commerce Sales Report 2023
With millions of products across thousands of sites, programmatically extracting product data at scale is no easy feat. Fortunately, the widespread adoption of schema markup has made it significantly easier in recent years.
The Rise of Schema Markup
Schema.org was launched in 2011 by Google, Bing, and Yahoo as a standardized vocabulary for adding semantic meaning to web pages. By tagging key pieces of information with predefined "types" and "properties", schema helps search engines and other tools better understand the contents of a page.
E-commerce sites in particular have embraced schema markup, as it can enable rich results and increase search visibility. The Product
schema type allows merchants to unambiguously specify details like price, availability, reviews, and many other product attributes.
According to Web Data Commons, 30% of all websites now include schema.org markup. For e-commerce sites, the adoption is even higher – over 50% of online stores now have schema according to BuiltWith.
Schema Adoption Over Time
Adding schema markup provides a double benefit for merchants – it can enhance appearance in search results, and it makes the product data much easier to scrape!
Parsing Product Schema
Here‘s an example of some product data represented in schema markup using JSON-LD syntax:
<script type="application/ld+json">
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Apple iPhone 14 Pro",
"description": "Apple iPhone 14 Pro smartphone with 6.1-inch Super Retina XDR display.",
"image": "https://example.com/iphone14pro.jpg",
"sku": "iphone14pro128gb",
"brand": {
"@type": "Brand",
"name": "Apple"
},
"offers": {
"@type": "Offer",
"price": "999.99",
"priceCurrency": "USD",
"availability": "http://schema.org/InStock",
"seller": {
"@type": "Organization",
"name": "Best Buy"
}
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.8",
"reviewCount": "6850"
}
}
</script>
With the product data clearly labeled like this, parsing it out becomes trivial compared to scraping unstructured HTML. Here‘s a simplified example using Python and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.example.com/product/iphone-14-pro"
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
schema_data = json.loads(soup.find(‘script‘, type=‘application/ld+json‘).text)
product_name = schema_data[‘name‘]
product_price = float(schema_data[‘offers‘][‘price‘])
product_currency = schema_data[‘offers‘][‘priceCurrency‘]
product_availability = schema_data[‘offers‘][‘availability‘]
seller_name = schema_data[‘offers‘][‘seller‘][‘name‘]
Rather than having to use complex CSS or XPath selectors to locate the desired data in the page HTML, we can access the values directly from the JSON-LD schema dictionary using the standard Product
property names.
Of course, real-world scraping is rarely quite this simple. Here are a few additional factors to consider:
Schema Inconsistency
While having schema present is great, the specific properties used can vary between sites. Some use the offers.price
property for price, others offers.lowPrice
/highPrice
, or Microdata rather than JSON-LD, etc.
It‘s important to analyze a variety of pages to understand the schema landscape for your specific use case. Tools like Google‘s Structured Data Testing Tool can help visualize how the schema is implemented across different sites and pages.
JavaScript Rendering
Many modern websites render content dynamically using JavaScript frameworks like React, Angular, etc. This can make scraping trickier, as the product schema may not be present in the initial page HTML.
To handle these cases, you‘ll need to use a headless browser like Puppeteer or Playwright that can execute JavaScript. These tools can wait for the full page (including schema) to render before extracting it.
Pagination and Infinite Scroll
Product category pages often span multiple pages or load additional results as the user scrolls. To fully scrape all products, you‘ll need to handle these pagination techniques – either by constructing URLs for each page, or simulating scroll events to trigger loading.
Anti-Bot Protections
Many large e-commerce sites employ bot detection and mitigation measures to prevent scraping. These can include blocking suspicious IPs, rate limiting, CAPTCHAs, user agent fingerprinting, and more.
Using rotating proxies, spoofing user agents, and adding random delays between requests can help avoid detection. For complex cases, tools like ScrapingBee and ScraperAPI provide proxy rotation and CAPTCHA solving out of the box.
Legal and Ethical Considerations
Is it legal to scrape e-commerce sites? As with most legal questions, it depends. Some key factors:
- Terms of Service – Many sites expressly prohibit scraping in their terms. Violating these terms could be grounds for a civil lawsuit.
- Copyright – Scraping copyrighted content like product descriptions or images and republishing may be considered infringement if done without permission.
- Trespass to Chattels – Excessive or poorly-behaved scraping that harms a website‘s functionality may be considered a "trespass to chattels" tort.
- Computer Fraud and Abuse Act – In the US, unauthorized access to password-protected sites may violate the CFAA. Exactly what "unauthorized" means is still a gray area legally.
Even if scraping is legal, the ethics are debatable. Best practices to be a "good citizen" scraper include:
- Respect
robots.txt
files - Don‘t scrape faster than needed; limit concurrent requests
- Only collect publicly accessible data
- Don‘t republish scraped content or personally identifying info
For more, see the EFF‘s Web Scraping Legal Guide.
Scraping Alternatives
While scraping can be powerful, it‘s not always the best approach to accessing product data. Some alternatives to consider:
APIs
Many large retailers like Amazon, Walmart, and Target offer official APIs for affiliates and partners to access product catalogs and pricing. Using APIs is generally faster and more reliable than scraping.
Turnkey Solutions
For those who prefer to outsource, there are SaaS providers that specialize in collecting and reselling e-commerce data. Examples include DataWeave, Intelligence Node, and PriceSpider.
These services can save significant development time, but may be cost-prohibitive depending on your data needs and budget.
Case Studies
To illustrate the value of web scraping for e-commerce, here are a few examples of companies leveraging scraped product data for different purposes:
Klover – Price Monitoring
Klover is a browser extension that automatically applies coupon codes at checkout. They use web scraping to monitor thousands of retailer sites and maintain a massive database of active discount codes.
According to Andrew Renaut, Klover‘s CTO:
Web scraping is an essential part of our tech stack at Klover. It allows us to collect discount codes at a scale and efficiency that would be impossible to replicate manually. We lean heavily on open source tools like Scrapy and Playwright, combined with cloud scraping platforms like ScrapingBee for particularly tricky sites. Schema markup has been a huge time saver, as we can extract fields like price and availability without having to reverse engineer each site‘s HTML structure.
Shein – Competitive Analysis
Shein is one of the world‘s fastest growing fast fashion brands. They‘re famous for rapidly replicating trendy styles at low prices.
While Shein has never publicly confirmed it, many industry watchers believe they use web scraping to keep tabs on competitors and identify fashion trends in near real-time. By collecting data on bestselling items, new arrivals, reviews, etc., Shein can quickly decide which styles to produce.
Ahrefs – SEO Research
Ahrefs is a popular SEO tool used by marketers to analyze and monitor search rankings. In addition to crawling search results directly, they use web scraping to collect schema markup from ranking pages.
Access to this structured data allows them to provide reports on how different schema types impact rankings, and identify opportunities for clients to add schema for better SEO. According to their documentation:
Ahrefs collects all JSON-LD and Microdata markup implemented on the pages that we crawl, and uses it to generate SERP Item Markup reports. These reports allow you to see how different types of markup correlate with rankings, and audit your site‘s schema implementation.
The Future of Web Scraping for E-Commerce
As e-commerce continues to grow and evolve, reliable product data will only become more valuable. I expect to see increased adoption of schema markup (and other structured data formats), making scraping easier.
At the same time, I anticipate large retailers will invest in more sophisticated anti-bot measures as they seek to protect their data as a competitive advantage. Techniques like browser fingerprinting, user behavior analysis, and machine learning will make it increasingly difficult to scrape large sites undetected.
Christian Skugstad, Co-Founder and CTO of Apify, shares his predictions:
In the future, e-commerce scraping will be an arms race between better markup standards and better bot detection. As data quality becomes a bigger competitive differentiator, retailers may try to restrict access to protect their advantages. Scraping will still be doable, but may require more specialized tools and expertise to stay ahead of the anti-bot arms race.
Regardless of the technical complexities, the fundamental value proposition of web scraping for e-commerce is not going away anytime soon. As long as buying and selling products online continues to grow, there will be a need for programmatic access to product data. Those who can reliably collect and leverage it will have a significant leg up over the competition.