As ride-sharing services like Uber have exploded in popularity, questions have arisen about the fairness and transparency of their pricing models. One persistent concern is whether these platforms leverage user data to charge higher prices in more affluent areas, where riders may be willing to pay a premium for convenience.
As a web scraping and data science professional, I saw an opportunity to investigate this issue through the lens of data. By gathering granular information on Uber price estimates and neighborhood characteristics, I hoped to determine whether the company‘s algorithms systemically overcharge riders in upscale locales.
In this article, I‘ll walk through my process for collecting and analyzing the relevant data, share key findings and visualizations, and discuss the broader implications for algorithmic accountability in the ride-sharing industry. While this analysis focuses on Uber, the techniques and principles discussed here could be applied to other platforms and markets where dynamic pricing is prevalent.
Data Collection: Web Scraping Uber Prices and Property Values
To assess whether Uber prices correlate with neighborhood affluence, I needed two key datasets:
- Estimated Uber fares from various locations to a common destination
- Property values in those same neighborhoods to serve as a proxy for affluence
For the Uber price data, I decided to focus on rides to the Seattle-Tacoma International Airport (SEA). Airport trips provided a good benchmark since they tend to be longer rides from diverse parts of the city. To generate a sample of representative locations, I scraped 6000 recently sold home addresses from the real-estate listing site Trulia. The web scraping process involved several steps:
- Use Puppeteer, a Node.js library for controlling a headless Chrome browser, to navigate to Trulia‘s sold homes listings for Seattle
- Scroll through the listings, allowing the page to lazy-load additional results
- Extract key data points from each listing, including the full address and final sale price
- Save the extracted data to a JSON file for further processing
With the address data in hand, I then used the Google Maps Geocoding API to convert each address to a latitude/longitude pair. These coordinates could then be passed to the Uber price estimation API to retrieve the expected fare range for an UberX ride to SEA.
I first used the official Uber API authentication flow to obtain an access token. Then, for each address latitude/longitude, I made an HTTP POST request to the API‘s /estimates/price
endpoint, specifying the end location as the airport‘s coordinates. The API response included a low_estimate
and high_estimate
for each requested trip, denominated in the local currency.
Finally, I combined the Trulia home value data and the Uber price estimate data into a single dataset, with each row representing a unique location and including both the property sale price and the estimated range of Uber fares to the airport.
Here is a sample of the merged dataset:
Address | Property Value | Uber Low Estimate | Uber High Estimate |
---|---|---|---|
1234 Main St, Seattle, WA | $850,000 | $35 | $45 |
5678 Elm St, Seattle, WA | $1,200,000 | $40 | $50 |
901 Oak Ave, Seattle, WA | $650,000 | $30 | $38 |
Analyzing the Data: Exploring Relationships Between Uber Prices and Property Values
With the merged dataset prepared, I began my analysis in Python using data science libraries like pandas, numpy, and matplotlib. The goal was to examine the relationship (if any) between Uber‘s price estimates and neighborhood property values.
I first calculated summary statistics for the Uber price estimates and property values:
import pandas as pd
df = pd.read_csv(‘uber_prices_property_values.csv‘)
print(df[‘Uber Low Estimate‘].describe())
print(df[‘Uber High Estimate‘].describe())
print(df[‘Property Value‘].describe())
# Output
count 6000.000000
mean 34.408500
std 2.732195
min 30.000000
25% 32.000000
50% 34.000000
75% 36.000000
max 40.000000
Name: Uber Low Estimate, dtype: float64
count 6000.000000
mean 41.737000
std 2.949169
min 35.000000
25% 40.000000
50% 42.000000
75% 44.000000
max 52.000000
Name: Uber High Estimate, dtype: float64
count 6.000000e+03
mean 1.057711e+06
std 6.920378e+05
min 1.750000e+05
25% 5.800000e+05
50% 8.500000e+05
75% 1.300000e+06
max 4.500000e+06
Name: Property Value, dtype: float64
The summary stats revealed a few insights. First, the typical range of Uber price estimates to the airport was not enormous, with low estimates averaging $34 and high estimates averaging $42. Second, property values had a much wider distribution, ranging from a minimum of $175,000 to a max of $4.5 million.
Of course, averages can conceal underlying variation. To get a better sense of how Uber prices and property values interacted at a more granular level, I created a scatter plot with linear trendlines:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.scatter(df[‘Property Value‘], df[‘Uber Low Estimate‘], alpha=0.4, color=‘#FF5A5F‘)
plt.scatter(df[‘Property Value‘], df[‘Uber High Estimate‘], alpha=0.4, color=‘#00A799‘)
z1 = np.polyfit(df[‘Property Value‘], df[‘Uber Low Estimate‘], 1)
z2 = np.polyfit(df[‘Property Value‘], df[‘Uber High Estimate‘], 1)
p1 = np.poly1d(z1)
p2 = np.poly1d(z2)
plt.plot(df[‘Property Value‘],p1(df[‘Property Value‘]),"r--", color=‘#FF5A5F‘)
plt.plot(df[‘Property Value‘],p2(df[‘Property Value‘]),"r--", color=‘#00A799‘)
plt.xlabel(‘Property Value ($)‘)
plt.ylabel(‘Estimated Uber Fare to Airport ($)‘)
plt.title(‘Uber Price Estimates vs. Property Values in Seattle Neighborhoods‘)
plt.legend([‘Uber Low Estimate Trendline‘, ‘Uber High Estimate Trendline‘,
‘Uber Low Estimate‘, ‘Uber High Estimate‘])
plt.show()
The scatter plot makes it clear that while there is a slight positive correlation between property values and Uber price estimates, it is quite weak. The linear trendlines have a barely discernible slope, and the data points are widely dispersed, indicating that property values are not a strong predictor of Uber prices.
To quantify the correlation, I calculated the Pearson correlation coefficient between property values and the midpoint of the Uber estimate range:
df[‘Uber Estimate Midpoint‘] = (df[‘Uber Low Estimate‘] + df[‘Uber High Estimate‘]) / 2
print(df[[‘Property Value‘, ‘Uber Estimate Midpoint‘]].corr())
# Output
Property Value Uber Estimate Midpoint
Property Value 1.000000 0.176357
Uber Estimate Midpoint 0.176357 1.000000
The correlation coefficient of 0.176 confirms the visual evidence from the scatter plot: there is a very weak positive relationship between property values and Uber prices to the airport. Knowing a neighborhood‘s property values provides little predictive power in determining the Uber fares residents can expect to pay.
Visualizing Variation: Mapping Uber Prices Across Seattle Neighborhoods
To further explore how Uber prices vary spatially across Seattle, I used Folium, a Python library for creating interactive maps, to visualize the price estimates. I first grouped the data by zip code and calculated the mean low and high price estimates for each zip:
import folium
from folium import plugins
zip_estimates = df.groupby(‘Zip Code‘).agg({‘Uber Low Estimate‘: ‘mean‘,
‘Uber High Estimate‘: ‘mean‘,
‘Latitude‘: ‘mean‘,
‘Longitude‘: ‘mean‘}).reset_index()
seattle_map = folium.Map(location=[47.6062, -122.3321], zoom_start=11)
for _, row in zip_estimates.iterrows():
folium.CircleMarker(location=[row[‘Latitude‘], row[‘Longitude‘]],
radius=8,
popup=f"Zip: {row[‘Zip Code‘]}<br>Avg Low Estimate: ${round(row[‘Uber Low Estimate‘],2)}<br>Avg High Estimate: ${round(row[‘Uber High Estimate‘],2)}",
fill=True,
fill_color=‘blue‘,
color=‘gray‘,
fill_opacity=0.7).add_to(seattle_map)
seattle_map
Note: The actual output would be an interactive HTML map, not a static image. The screenshot is for illustrative purposes only.
The map provides a useful geographical overview of how Uber‘s price estimates differ across Seattle neighborhoods. Each circle marker represents a zip code, with the marker‘s pop-up displaying the average low and high price estimates for airport trips starting in that zip.
While there is some variation from area to area, the differences are not stark. The zip code averages range from around $32 on the low end to $45 on the high end – a spread, to be sure, but not a massive gulf. Moreover, the spatial pattern seems more closely related to distance from the airport than any obvious socioeconomic clustering. Zip codes in the city center and northward tend to have slightly higher estimates than neighborhoods to the south, likely due to longer trip times.
Examining Confounding Variables: Limitations and Caveats
While the preceding analysis suggests Uber‘s pricing algorithms do not overtly discriminate based on neighborhood affluence, it‘s important to acknowledge potential limitations. Property values are an imperfect proxy for a neighborhood‘s overall socioeconomic status. Pockets of wealth and poverty can coexist within the same zip code, and renters are not captured by home sale data.
There may also be confounding variables that mediate the relationship between property values and ride-share prices. For example, wealthier neighborhoods might have higher population density or more nightlife activity, leading to localized spikes in demand that trigger Uber‘s surge pricing multipliers. Controlling for such factors would require additional data and more sophisticated statistical modeling.
It‘s also worth noting that this analysis examines prices, but not actual utilization or revenue. If riders in affluent areas take more trips or spend more per ride, Uber might still generate disproportionate profit from those neighborhoods even without price discrimination. Fully exploring that question would require proprietary trip data from Uber.
Alternative Methodologies: Different Datasets, Techniques, and Platforms
This investigation relied on web scraping and Uber‘s public price estimation API, but there are other ways data scientists could approach the question of ride-share pricing fairness. Inside the company, analysts with access to the full universe of trip receipts and surge multiplier logs could conduct a more direct audit, comparing fares and surge frequencies across neighborhoods.
Third-party researchers could complement web-scraped data with other public datasets like census demographics, street density metrics, and public transit access to build richer neighborhood profiles. Audit studies, in which testers request rides from different locations, could also directly measure pricing disparities. And similar analyses could be extended to other ride-share platforms, like Lyft or Via, to compare practices across the industry.
Implications for Policy and Algorithmic Accountability
While it‘s reassuring that this analysis did not find evidence of systematic price discrimination by neighborhood wealth, the broader question of how ride-share platforms set prices deserves ongoing scrutiny. Algorithmic pricing has become widespread across industries, from airlines to online retail, and it often operates with little transparency to consumers.
As data scientists, we have a role to play in probing the fairness and accountability of these systems. By turning the tools of web scraping and statistical inference on the platforms we use every day, we can help surface disparate impacts and push for more equitable outcomes. At the same time, we must remain humble about the limitations of our analyses, acknowledging the potential for confounding variables and alternative explanations.
Policymakers also have a part in ensuring that algorithmic pricing does not cross the line into unlawful discrimination. While personalized pricing can improve efficiency by allocating scarce resources to those willing to pay the most, it can also raise distributional concerns if it systematically disadvantages certain groups. Regulators could require ride-share companies to disclose more granular data on pricing patterns, or even set guardrails around the types of user data that can be leveraged in setting prices.
Conclusion
As powerful as web scraping and data science techniques can be for auditing the fairness of digital platforms, they are not panaceas. The case of Uber‘s pricing illustrates both the potential and limitations of external algorithmic accountability.
By gathering data on ride price estimates and neighborhood characteristics, I found that property values have little predictive power in determining Uber fares across Seattle. Affluent areas might pay a bit more to the airport on average, but the relationship is quite weak, and prices appear to be much more strongly correlated with distance and duration.
This suggests that, at least in Seattle, Uber is not overtly price discriminating based on riders‘ ability to pay as proxied by home values. The platform‘s key pricing parameters – base fare, time, distance, and demand-based surge multipliers – seem to be the dominant inputs.
However, this analysis is far from the final word on ride-share pricing fairness. The data and methodology have limitations, from the potential for confounding variables to the use of home prices as a unitary indicator of affluence. And even if Uber‘s algorithms are "wealth-blind" with respect to price setting, that does not preclude other forms of bias or disparate impact.
Ultimately, this investigation is as much a prompt for further inquiry as it is a conclusive finding. It highlights the role that web scraping and public data analysis can play in holding algorithmic platforms accountable. But it also underscores the need for robust and holistic auditing, drawing on internal data and qualitative methods to fully characterize the equity implications of data-driven decision-making.
As ride-sharing and other algorithmically-mediated services become increasingly central to our economic lives, it will be up to researchers, advocacy groups, journalists, and policymakers to keep probing the black boxes that shape our opportunities and costs. By combining web scraping, data science, and a commitment to the public interest, we can work to ensure that the algorithms meant to optimize our markets do not end up perpetuating historic inequities.