How to Scrape Real Estate Data from Idealista Using Python and Selenium (2024 Guide)
Idealista is the leading real estate portal in Spain, Italy and Portugal, listing millions of properties for sale and rent. It‘s a goldmine of valuable data for real estate investors, analysts, and businesses looking to gain insights into property markets in these regions.
In this in-depth guide, we‘ll walk through how to programmatically scrape a wealth of data from Idealista using Python and Selenium. By the end, you‘ll have a fully functional web scraper that can extract:
- The list of provinces and municipalities across Spain
- Property listings in each location, with pagination
- Key details for each listing, including title, price, description, size, and URL
- Strategies for evading bot detection to scrape at scale
Whether you‘re a data scientist, real estate professional, or just a programming enthusiast, read on to learn the ins and outs of scraping Idealista! All the code is available on GitHub to follow along.
Prerequisites
To build our Idealista scraper, we‘ll be using the following tools and libraries:
- Python 3.7+
- Selenium
- undetected-chromedriver
You can install the required Python packages with:
pip install selenium undetected-chromedriver
Selenium is a powerful framework for automating web browsers, which we‘ll leverage to programmatically interact with Idealista. The undetected-chromedriver library provides a drop-in replacement for Selenium‘s default ChromeDriver, with extra functionality to evade common bot detection techniques.
With the setup out of the way, let‘s start building our scraper!
Overview of the Scraping Process
At a high level, our scraper will follow this process:
- Launch a browser instance and navigate to Idealista Spain homepage
- Extract the list of province links
- For each province, navigate to its dedicated page and extract the list of municipality links
- For each municipality, navigate to its property listings page
- Extract key details for each listing (title, price, description, etc.), navigating through pagination
- Compile the scraped data into a structured format and save it to disk
We‘ll modularize each of these steps into separate functions for cleaner, more maintainable code.
Extracting the List of Provinces
Our first task is to extract the list of provinces in Spain that Idealista has property listings for.
Inspecting the Idealista homepage, we can see the province list is contained in a
Each province is a link, so we‘ll extract the name and URL for each one. Here‘s the function to do that:
def get_provinces(driver):
driver.get("https://www.idealista.com")
province_div = driver.find_element(By.CLASS_NAME, ‘locations-list‘)
province_links = province_div.find_elements(By.TAG_NAME, ‘a‘)
provinces = []
for link in province_links:
provinces.append({
‘name‘: link.text,
‘url‘: link.get_attribute(‘href‘)
})
return provinces
We launch a browser instance using undetected-chromedriver and navigate to the Idealista homepage. Then we locate the div containing the province list and extract each link element.
For each link, we extract its anchor text (the province name) and URL, appending it to the provinces list that gets returned.
Extracting Municipalities for Each Province
With the province list in hand, the next step is to extract the municipalities belonging to each province.
Navigating to a province page, we can see the municipality list has a similar structure, contained in a
- tag with the ID location_list.
We‘ll follow the same process as before, extracting the name and URL for each municipality link:
def get_municipalities(driver, province):
driver.get(province[‘url‘])
muni_list = driver.find_element(By.ID, ‘location_list‘)
muni_links = muni_list.find_elements(By.TAG_NAME, ‘a‘)
municipalities = []
for link in muni_links:
municipalities.append({
‘name‘: link.text,
‘url‘: link.get_attribute(‘href‘)
})
return municipalities
The function takes the province dictionary and launches the browser to its URL. It then extracts the list of municipality links, returning a list of dictionaries with the name and URL for each one.
We can tie the province and municipality extraction together like this:
provinces = get_provinces(driver)
for province in provinces:
municipalities = get_municipalities(driver, province)
province[‘municipalities‘] = municipalities
This nests the municipality data inside each corresponding province dictionary.
Extracting Property Listings
Now it‘s time for the meat of the scraper – extracting the actual property listings from each municipality.
Analyzing a municipality listings page, we can see each property is an
To handle pagination, we‘ll use a recursive function. It will extract the listings on the current page, check if there‘s a "next page" link, and if so, call itself with the next page URL to extract those listings too.
Here‘s the code:
def get_listings(driver, url):
driver.get(url)
listings = []
listing_elements = driver.find_elements(By.XPATH, ‘//article[contains(@class, "item")]‘)
for listing in listing_elements:
listing_data = {
‘title‘: listing.find_element(By.XPATH, ‘.//a[@class="item-link"]‘).text,
‘price‘: listing.find_element(By.XPATH, ‘.//span[@class="item-price"]‘).text,
‘description‘: listing.find_element(By.XPATH, ‘.//div[@class="item-description description"]‘).text,
‘size‘: listing.find_element(By.XPATH, ‘.//div[@class="item-detail-char"]‘).text,
‘url‘: listing.find_element(By.XPATH,‘.//a[@class="item-link"]‘).get_attribute(‘href‘)
}
listings.append(listing_data)
next_page_link = driver.find_elements(By.XPATH, ‘//a[@class="icon-arrow-right-after"]‘)
if next_page_link:
next_page_url = next_page_link[0].get_attribute(‘href‘)
listings += get_listings(driver, next_page_url)
return listings
We use Selenium‘s find_elements method to extract all the
If there‘s a "next page" link (identified by the icon-arrow-right-after class), the function extracts its URL and recursively calls itself to get the listings from the next page.
All the scraped listings get returned as a list of dictionaries.
To tie it all together:
for province in provinces:
for muni in province[‘municipalities‘]:
listings = get_listings(driver, muni[‘url‘])
muni[‘listings‘] = listings
This extracts the listings for each municipality and nests that data in the municipality dictionary.
Avoiding Detection and Next Steps
As you may have noticed, we didn‘t implement any specific anti-detection measures in the code beyond using undetected-chromedriver.
In practice, a large-scale scraper would almost certainly encounter CAPTCHAs, IP blocking, and other countermeasures. Some ways to make your scraper more robust:
- Incorporate headless browsing
- Introduce random delays between requests
- Utilize a CAPTCHA solving service
- Use a premium proxy service like ScrapingBee to rotate IP addresses
With a fully-fledged scraper extracting property data, the next step is to do something useful with it! Some ideas:
- Visualize real estate market trends and prices across different locations
- Compare prices and sizes to identify investment opportunities
- Feed listing data into machine learning models to predict future prices
The raw ingredients provided by this Idealista scraper open up a world of possibilities for real estate analysis and applications. I encourage you to adapt the code to your specific needs and see what other creative uses you can come up with.
As always, respect website terms of service, don‘t overwhelm servers with requests, and use scraped data ethically.
Happy scraping!