Web scraping often requires logging into websites in order to access pages and data only available to authenticated users. However, programmatically logging into websites can be challenging. You need to handle form submissions, manage cookies and authentication tokens, and deal with CAPTCHAs and other anti-bot measures.
Fortunately, the ScrapingBee web scraping API service provides some handy features to simplify logging into websites when scraping with Node.js. In this tutorial, we‘ll walk through three different methods to log in to websites using ScrapingBee and Node.js:
- Automating logins with a JavaScript scenario
- Logging in via a direct POST request
- Authenticating by attaching cookies
We‘ll provide detailed, up-to-date code examples for each approach. To follow along, you‘ll need a ScrapingBee API key.
Method 1: JavaScript Login Scenario
With this method, we instruct ScrapingBee to run a series of actions to automate interacting with a website‘s login form, just like a human user would. Here‘s an example using the official scrapingbee Node.js SDK:
const scrapingbee = require(‘scrapingbee‘);
const fs = require(‘fs‘);
const client = new scrapingbee.ScrapingBeeClient(‘YOUR_API_KEY‘);
const loginScenario = {
instructions: [
{fill: ["#username", "your-username"]},
{fill: ["input[name=‘password‘]", "your-password"]},
{click: "button[type=‘submit‘]"},
{wait: 1000}
]
};
client.get({
url: ‘https://example.com/login‘,
params: {
js_scenario: loginScenario,
screenshot: ‘true‘
}
})
.then((response) => fs.writeFileSync(‘after-login.png‘, response.data))
.catch(console.error);
This tells ScrapingBee to:
- Navigate to the login page
- Fill in the username and password fields
- Click the submit button
- Wait 1 second for the page to load
- Take a screenshot to confirm we‘re logged in
The CSS selectors used to target the form fields and button may need to be adapted for the specific site you‘re logging into.
Pros:
- Fully automates the login process
- Mimics normal user behavior, less likely to be detected as a bot
Cons:
- May not work if login form is complicated or uses dynamic IDs
- Slow, as it loads pages in a full browser environment
Method 2: Direct Login POST Request
Inspecting the network activity when logging in shows that ultimately the login form submits a POST request with the username and password to a specific URL. We can replicate that POST request to log in more directly.
Here‘s how to implement this using the node-fetch
library:
const fetch = require(‘node-fetch‘);
fetch(‘https://app.scrapingbee.com/api/v1‘, {
method: ‘POST‘,
headers: {
‘Content-Type‘: ‘application/json‘
},
body: JSON.stringify({
api_key: ‘YOUR_API_KEY‘,
url: ‘https://example.com/login‘,
method: ‘POST‘,
body: {
username: ‘your-username‘,
password: ‘your-password‘
}
})
})
.then(response => response.text())
.then(body => console.log(body))
.catch(console.error);
We send a POST request to ScrapingBee with our API key, the login URL, specifying it should use the POST method, and providing the login credentials to submit.
The HTTP response will include any cookies set, which you could extract and pass to subsequent requests to stay logged in.
Pros:
- Faster, no need to load pages or run JavaScript
- More reliable for simple login forms
Cons:
- Requires figuring out the exact POST parameters to send
- More detectable as a bot, riskier
Method 3: Login with Cookies
If you can get ahold of the session cookie set when logged into a site, you can attach that to your scraping requests and it will authenticate you.
First, log into the target site manually with your browser. Then use the developer tools to find and copy the value of the session cookie. It will likely have a name like sessid
or auth_token
.
Here‘s how to use that cookie value with ScrapingBee to access a page requiring authentication:
const scrapingbee = require(‘scrapingbee‘);
const client = new scrapingbee.ScrapingBeeClient(‘YOUR_API_KEY‘);
client.get({
url: ‘https://example.com/private‘,
cookies: {
name: ‘session_id‘,
value: ‘your-session-cookie-value‘
}
})
.then(response => console.log(response.data))
.catch(console.error);
Pros:
- Very simple, no need to fiddle with complex login sequences
- Efficient, just attach cookie to requests as needed
Cons:
- Requires manually logging in and extracting the session cookie
- Session cookies can expire requiring you to repeat the process
Tips for Handling Login Challenges
Some websites employ additional challenges for logins to prevent bots and other abuse. Some common ones and how to deal with them:
CAPTCHAs – ScrapingBee has built-in support for solving reCAPTCHA v2 and v3 challenges. Enable it by adding
solve_recaptcha: true
to your requests.2 Factor Authentication – If you have access to the email/phone number, you can manually retrieve the code and provide it as part of a JavaScript scenario. ScrapingBee also supports automating 2FA with its 2captcha integration.
Rare User Agents/Devices – ScrapingBee allows you to specify a custom User-Agent header and even a specific device to emulate. Use the
user_agent
anddevice
parameters.Disabling JavaScript – Some sites require JavaScript to be enabled to work properly. Make sure to set
js_enabled: true
in your ScrapingBee requests.
Choosing a Login Method
Which login approach to use depends on the complexity of the site you‘re logging into. In general, start by trying the direct POST method as it‘s the simplest and most efficient.
If the login requires executing JavaScript or has additional fields and interactions, try the automated JavaScript scenario approach.
Use the cookies approach for simple authentication if you‘re able to easily grab a session cookie manually.
Mix and match these techniques as needed. For example, you could log in with a JavaScript scenario once to get a session cookie, then attach that cookie to future requests.
With the flexibility of ScrapingBee and Node.js, you can handle logging in to most websites to scrape data behind authentication.
Let me know if you have any other questions! Happy scraping!