Content marketing has become fiercely competitive. Over 4 million blog posts are published every day, while 91% of marketers say they use content marketing in some form.^1 Standing out in this deluge of content and attracting organic traffic requires a strategic approach.
One powerful but often overlooked tactic is supplementing your original content production with curated 3rd party content extracted from other websites. When done well, this can help you:
- Build a richer content resource for your audience
- Fill gaps in your content calendar
- Identify trending topics and insights for your industry
- Improve SEO with additional internal linking opportunities
- Gain a competitive edge through content and data analysis
Best of all, you don‘t need coding skills to automatically extract content from websites at scale. Modern no-code web scraping tools and content aggregation platforms have made it easier than ever to integrate 3rd party content into your marketing mix.
Content Extraction 101: How Web Scrapers and Aggregators Work
Before we dive into specific tools and tactics, let‘s set the stage with an overview of how content extraction works. There are two main approaches:
Web Scraping
Web scraping refers to automatic extraction of publicly available data from web pages. At a high level, web scraping tools work by:
- Sending an HTTP request to the target page
- Downloading the page‘s HTML content
- Parsing the HTML to extract the desired data points
- Saving the extracted data in a structured format like CSV or JSON
Modern scrapers like Octoparse automate this process with a visual interface that lets you select the data points you want with no coding required. Under the hood, they construct and render the appropriate HTTP requests and HTML/CSS selectors to surgically extract your target content.
Here are some of the key features and benefits of web scraping for content marketers:
- Extract full article text, images, meta descriptions, author info, publish dates, and other page elements
- Scrape articles across paginated category and tag archives
- Schedule recurring scrapes to automatically extract newly published content
- Gather social share counts, likes, and comments to gauge content performance
- Compile content from competitors to identify whitespace opportunities
- Leveraging rotating proxies and headless browsing to scrape at scale
- Export content in structured formats for analysis in Excel, Google Sheets, or BI tools
Some popular use cases of web scraping for content marketing include building news aggregators, curating niche content hubs, generating AI training datasets, and more.
Content Aggregation
Content aggregation platforms offer another approach to extracting 3rd party content. Rather than raw scraping, they focus on curating and syndicating content through RSS, APIs, and partnerships with publishers.
Leading aggregators like Flipboard and Pocket have feeds of recommended content personalized to each user. Users can also save or "clip" content from around the web into their library.
Other aggregators focus more on providing curated content feeds to businesses. They offer features like:
- Customized content streams filtered by keywords, sites, and categories
- AI and machine learning to surface most relevant content
- Licensed content from premium publishers
- Analytics on content engagement and performance
- Embeddable content widgets for your website
- APIs to sync content to your CMS, email marketing tool, etc.
By outsourcing content discovery and curation to these platforms, marketers can boost content output and engagement without a huge increase in production time and cost.
Choosing Your Weapon: Top Content Extraction Tools
Now that you understand the basics of how web scrapers and content aggregators work, let‘s take a closer look at some of the top tools. Here‘s how three popular options stack up:
Octoparse
Octoparse is a powerful web scraping tool designed for non-coders. Its visual point-and-click interface makes it easy to extract text, images, and other page elements.
Key features include:
- Pre-built scrapers for popular sites like Amazon and Twitter
- Ability to scrape infinite scroll and other dynamic page elements
- Workflow designer to combine multiple scraping tasks
- Built-in data cleaning and de-duplication
- Cloud-based scraping and API access
Octoparse offers a free plan with 10K monthly records. Paid plans start at $79/month. It‘s an excellent all-around choice for marketers new to web scraping.
Parsehub
Parsehub is another powerful visual web scraping tool in the same vein as Octoparse. It boasts a slick desktop app for Mac and Windows.
Key features include:
- Intuitive click-and-extract interface
- Scrape behind logins and handle complex navigation paths
- Built-in proxy rotation and CAPTCHA solving
- API access and webhooks for easy integration
Parsehub‘s free plan offers 5 public projects. Paid plans start at $149/month. It‘s a great choice for large scale scraping jobs.
Feedly
Shifting gears to content aggregation, Feedly is one of the most popular tools for curating feeds of blogs, news, and other publications. Over 15 million users rely on Feedly to stay informed.^3
Key features include:
- Curate content from RSS feeds, Google Alerts, newsletters, and more
- Leo AI assistant for article summaries and highlights
- Collaborate with team members using shared tags and boards
- Integrates with 3000+ apps via Zapier and IFTTT
- Distraction-free reading and bookmarking
Feedly offers a limited free plan. Paid plans with AI features start at $8/month. It‘s great for content research and competitive monitoring.
Web Scraping Best Practices for Content Marketers
Whichever tool you choose, there are some important guidelines to keep in mind when scraping content from other websites:
1. Respect Robots.txt
The robots.txt file specifies which pages and directories a website allows scrapers to access. Respecting these rules is not only ethical but will help you avoid IP blocking. Tools like Octoparse automatically follow robots.txt conventions.
2. Throttle Your Crawl Rate
Sending too many requests to a website too quickly can overload their servers and get you banned. A good rule of thumb is waiting at least 10-15 seconds between requests and avoiding scraping during peak traffic times.
3. Don‘t Disguise Your Scraper
Using realistic user agent strings helps site owners understand how you‘re using their content and prevents your scraper from looking like a malicious bot. Include a custom user agent with your company name and contact info.
4. Cache and Proxy
Caching scraped content prevents unnecessary re-scraping that can get you rate limited. And using proxy IPs allows you to distribute requests across different locations to minimize individual server load.
5. Aggregate, Don‘t Plagiarize
Directly republishing scraped content without substantive modification could raise duplicate content and copyright infringement issues. Instead, use scraped content as one input in a broader curation and editorialization process. Always cite and link to the original source.
The Future of Content Extraction
Going forward, I believe several key developments will make content extraction even more powerful for marketers:
Structured Data – More websites are adopting structured data formats like Schema.org to help search engines understand their content. This will also make it easier for scrapers to reliably extract key content attributes.
Full-Content RSS – While RSS isn‘t as popular as it once was, I expect renewed interest in providing full-content RSS feeds as a way for publishers to reach more readers and monetize their back catalog. Specialized content aggregators will be the main beneficiaries.
AI-Generated Content – As generative language models like GPT-3 improve, more websites will publish auto-generated content at massive scale. Scrapers and aggregators with strong content filtering and verification systems will become essential for separating signal from noise.
Content Licensing – Startups like Unlock are developing blockchain-based content licensing and micropayment protocols.^4 These could evolve to support more granular and decentralized licensing of scraped or aggregated content, with royalties automatically paid to the creators.
By staying on top of these trends and leveraging the power of automated content extraction, marketers can gain a significant edge in the battle for audience attention. But as content becomes more accessible and machine-generated, the bar for quality will only get higher.
The key is to use scraped content as a starting point for your own original research, analysis and creative. By curating the most relevant insights for your audience and adding your own unique perspective, you‘ll be well positioned to win the content game.