How to scrape data from idealista

How to Scrape Real Estate Data from Idealista Using Python and Selenium (2024 Guide)

Idealista is the leading real estate portal in Spain, Italy and Portugal, listing millions of properties for sale and rent. It‘s a goldmine of valuable data for real estate investors, analysts, and businesses looking to gain insights into property markets in these regions.

In this in-depth guide, we‘ll walk through how to programmatically scrape a wealth of data from Idealista using Python and Selenium. By the end, you‘ll have a fully functional web scraper that can extract:

The list of provinces and municipalities across Spain
Property listings in each location, with pagination
Key details for each listing, including title, price, description, size, and URL
Strategies for evading bot detection to scrape at scale

Whether you‘re a data scientist, real estate professional, or just a programming enthusiast, read on to learn the ins and outs of scraping Idealista! All the code is available on GitHub to follow along.

Prerequisites

To build our Idealista scraper, we‘ll be using the following tools and libraries:

Python 3.7+
Selenium
undetected-chromedriver

You can install the required Python packages with:

pip install selenium undetected-chromedriver

Selenium is a powerful framework for automating web browsers, which we‘ll leverage to programmatically interact with Idealista. The undetected-chromedriver library provides a drop-in replacement for Selenium‘s default ChromeDriver, with extra functionality to evade common bot detection techniques.

With the setup out of the way, let‘s start building our scraper!

Overview of the Scraping Process

At a high level, our scraper will follow this process:

Launch a browser instance and navigate to Idealista Spain homepage
Extract the list of province links
For each province, navigate to its dedicated page and extract the list of municipality links
For each municipality, navigate to its property listings page
Extract key details for each listing (title, price, description, etc.), navigating through pagination
Compile the scraped data into a structured format and save it to disk

We‘ll modularize each of these steps into separate functions for cleaner, more maintainable code.

Extracting the List of Provinces

Our first task is to extract the list of provinces in Spain that Idealista has property listings for.

Inspecting the Idealista homepage, we can see the province list is contained in a

with a locations-list class:

Each province is a link, so we‘ll extract the name and URL for each one. Here‘s the function to do that:

def get_provinces(driver):
  driver.get("https://www.idealista.com")

  province_div = driver.find_element(By.CLASS_NAME, ‘locations-list‘)
  province_links = province_div.find_elements(By.TAG_NAME, ‘a‘)

  provinces = []
  for link in province_links:
    provinces.append({
      ‘name‘: link.text,
      ‘url‘: link.get_attribute(‘href‘)
    })

  return provinces

We launch a browser instance using undetected-chromedriver and navigate to the Idealista homepage. Then we locate the div containing the province list and extract each link element.

For each link, we extract its anchor text (the province name) and URL, appending it to the provinces list that gets returned.

Extracting Municipalities for Each Province

With the province list in hand, the next step is to extract the municipalities belonging to each province.

Navigating to a province page, we can see the municipality list has a similar structure, contained in a

We‘ll follow the same process as before, extracting the name and URL for each municipality link:

def get_municipalities(driver, province):
  driver.get(province[‘url‘])

  muni_list = driver.find_element(By.ID, ‘location_list‘) 
  muni_links = muni_list.find_elements(By.TAG_NAME, ‘a‘)

  municipalities = []
  for link in muni_links:
    municipalities.append({
      ‘name‘: link.text,
      ‘url‘: link.get_attribute(‘href‘)  
    })

  return municipalities

The function takes the province dictionary and launches the browser to its URL. It then extracts the list of municipality links, returning a list of dictionaries with the name and URL for each one.

We can tie the province and municipality extraction together like this:

provinces = get_provinces(driver)

for province in provinces:
  municipalities = get_municipalities(driver, province)
  province[‘municipalities‘] = municipalities

This nests the municipality data inside each corresponding province dictionary.

Extracting Property Listings

Now it‘s time for the meat of the scraper – extracting the actual property listings from each municipality.

Analyzing a municipality listings page, we can see each property is an

tag with the class item:

To handle pagination, we‘ll use a recursive function. It will extract the listings on the current page, check if there‘s a "next page" link, and if so, call itself with the next page URL to extract those listings too.

Here‘s the code:

def get_listings(driver, url):
  driver.get(url)

  listings = []

  listing_elements = driver.find_elements(By.XPATH, ‘//article[contains(@class, "item")]‘)
  for listing in listing_elements:
    listing_data = {
      ‘title‘: listing.find_element(By.XPATH, ‘.//a[@class="item-link"]‘).text,
      ‘price‘: listing.find_element(By.XPATH, ‘.//span[@class="item-price"]‘).text,
      ‘description‘: listing.find_element(By.XPATH, ‘.//div[@class="item-description description"]‘).text, 
      ‘size‘: listing.find_element(By.XPATH, ‘.//div[@class="item-detail-char"]‘).text,
      ‘url‘: listing.find_element(By.XPATH,‘.//a[@class="item-link"]‘).get_attribute(‘href‘)
    }
    listings.append(listing_data)

  next_page_link = driver.find_elements(By.XPATH, ‘//a[@class="icon-arrow-right-after"]‘)
  if next_page_link:
    next_page_url = next_page_link[0].get_attribute(‘href‘) 
    listings += get_listings(driver, next_page_url)

  return listings

We use Selenium‘s find_elements method to extract all the

elements on the page. Then for each one, we extract the key bits of data – title, price, description, size, and URL.

If there‘s a "next page" link (identified by the icon-arrow-right-after class), the function extracts its URL and recursively calls itself to get the listings from the next page.

All the scraped listings get returned as a list of dictionaries.

To tie it all together:

for province in provinces:
  for muni in province[‘municipalities‘]:
    listings = get_listings(driver, muni[‘url‘]) 
    muni[‘listings‘] = listings

This extracts the listings for each municipality and nests that data in the municipality dictionary.

Avoiding Detection and Next Steps

As you may have noticed, we didn‘t implement any specific anti-detection measures in the code beyond using undetected-chromedriver.

In practice, a large-scale scraper would almost certainly encounter CAPTCHAs, IP blocking, and other countermeasures. Some ways to make your scraper more robust:

Incorporate headless browsing
Introduce random delays between requests
Utilize a CAPTCHA solving service
Use a premium proxy service like ScrapingBee to rotate IP addresses

With a fully-fledged scraper extracting property data, the next step is to do something useful with it! Some ideas:

Visualize real estate market trends and prices across different locations
Compare prices and sizes to identify investment opportunities
Feed listing data into machine learning models to predict future prices

The raw ingredients provided by this Idealista scraper open up a world of possibilities for real estate analysis and applications. I encourage you to adapt the code to your specific needs and see what other creative uses you can come up with.

As always, respect website terms of service, don‘t overwhelm servers with requests, and use scraped data ethically.

Happy scraping!

How to Web Scrape Amazon.com Using Python in 2023

Parsing HTML & XML in Ruby: An Expert‘s Guide to the Top Libraries in 2024

How to Use Proxies with Ruby and Faraday for Web Scraping

execute curl command as subprocess and capture output

The Definitive Guide to Web Crawling with Python

Web Scraping vs Web Crawling: The Ultimate Guide for 2024

Parsing HTML with Ruby and Nokogiri: The Definitive Guide

Hacker News

How to scrape data from idealista

Related