top of page

Linear and Logistic Paths To Predictive Insights

Writer's picture: Ashish John EdwardAshish John Edward

Updated: Oct 19, 2024

Machine learning is all around us, whether you're using a virtual assistant, scrolling through social media, or getting recommendations for your next movie. For a newcomer, the term might seem intimidating, but at its core, machine learning is just about teaching computers to learn from data and make decisions without being explicitly programmed.


In this article, let’s look at supervised learning algorithms, specifically regression.


Before diving into the above algorithms, it's crucial to understand the two main categories of machine learning quickly :

Supervised Learning


In supervised learning, the machine is given labelled data, which means that the data contains both the input and the correct output. The goal of supervised learning is to learn a function that maps inputs to the correct outputs.


Imagine you’re teaching a child to recognize animals. You show them pictures of animals and tell them what each animal is (like, "this is a dog"). Over time, the child learns to recognize animals even without your help. This is what happens in supervised learning: the machine is trained using labeled examples so that it can predict the correct label for new, unseen data.


There are two main types of supervised learning algorithms :


  • Regression: When the output variable is a continuous number.


  • Classification: When the output variable is a category.

Unsupervised Learning


Unsupervised learning deals with data that is unlabelled, meaning the algorithm is left to find patterns or structure on its own. Unlike supervised learning, there's no clear answer or outcome to predict.


Think of this as observing a group of animals in the wild without knowing their names or species. The goal here is to find similarities and differences, grouping similar animals together, even though you don’t have predefined categories. In unsupervised learning, the machine tries to group data points based on patterns.


The most common type of unsupervised learning is clustering, where the algorithm groups similar data points together.


Supervised Learning Algorithms


In this article, lets deep dive into the world of Supervised learning algorithms, specifically, Regression – both Linear & Logistic.

 

Regression analysis, particularly Linear Regression and Logistic Regression, is one of the most fundamental concepts in data science and machine learning. These two techniques are used to predict outcomes based on relationships in data, but they have very different purposes. Whether you're predicting sales growth for a business or trying to determine the probability that a customer will buy a product, mastering these algorithms can be a game-changer.


Linear Regression: The Backbone of Predictive Modelling


Let’s think about a scenario: You’re the manager of a growing startup, and you want to forecast how much revenue you’ll generate next month based on the current trends in customer acquisition. You've collected some data over the past few months, and you’ve noticed that as customer acquisitions go up, so does your revenue. Now you want to quantify this relationship to make future predictions.


Linear Regression can help you draw a line through your data to find the relationship between customer growth and revenue. Once this line is drawn, you can input any number of customers and get an estimate of the revenue you'll make. It’s a simple yet powerful way to predict continuous outcomes.


The formula that represents this relationship : Y=b0+b1X


Where:

Y is the outcome you want to predict (in our case, revenue).

X is the independent variable (in our case, customer acquisitions).

b_0 is the intercept, or the value of Y when X is zero (think of it as the starting point).

b_1 is the slope, which represents how much Y changes for every one-unit change in X.


In our startup example, the slope tells you how much additional revenue you can expect for each new customer.


Airbnb – Using Linear Regression to Optimize Pricing


Problem

Airbnb relies heavily on dynamic pricing to ensure that hosts maximize their income while also offering competitive rates to guests. However, setting the right price involves many factors—location, seasonality, demand, and competition. Airbnb needed a model to predict the optimal price for each listing based on these variables.


Solution

Airbnb’s data scientists used Linear Regression to analyse historical data on booking prices, demand fluctuations, and other variables like proximity to tourist destinations or amenities. The Linear Regression model allowed Airbnb to predict the optimal price for listings based on current market conditions and demand. This model became the backbone of Airbnb's "Smart Pricing" feature, which automatically adjusts prices for hosts to match the predicted optimal rate.


Financial Impact

The introduction of Smart Pricing led to a significant increase in revenue for Airbnb hosts by helping them price their listings more competitively. Hosts using Smart Pricing saw a 5-20% increase in bookings on average, which, in turn, increased Airbnb’s commission revenue. The feature helped Airbnb grow into a multibillion-dollar company, with revenue exceeding $2.1 billion in Q2 2023.


Lets look at another interesting case where Linear regression has played a key role.


Zillow – Predicting Housing Prices


Problem

Zillow, the real estate giant, needed a way to provide potential home buyers and sellers with accurate and reliable estimates of home prices across the U.S. The challenge was that housing prices are influenced by a variety of factors—such as location, square footage, number of bedrooms, and market conditions. Zillow wanted to develop a model that could predict the price of any given home based on these features.


Solution

Zillow’s team of data scientists employed Linear Regression to build their home valuation tool called Zestimate. The model analyses historical sales data, along with real-time information on housing features (e.g., square footage, age of the home, lot size) and market conditions (e.g., interest rates, employment rates). By fitting a Linear Regression model, Zillow was able to predict housing prices for millions of homes across different neighbourhoods, taking into account multiple features that impact price.


Financial Impact

Zestimate quickly became Zillow’s most famous feature and drove massive traffic to its website, which became one of the top go-to platforms for real estate searches in the U.S. Zestimate helped Zillow capture a significant portion of the real estate market, with millions of homebuyers and sellers relying on its pricing estimates. As of 2023, Zillow’s annual revenue exceeded $2.1 billion, largely due to the success of its data-driven pricing tool.


Now, let’s summarize when and how to use linear regression.

 


 Logistic Regression: Making Predictions in a Binary World


While Linear Regression is used for predicting numbers, Logistic Regression is used for predicting categories. The most common use case is when you want to predict a binary outcome, like “yes or no,” “buy or not buy,” “spam or not spam.”


Let’s imagine you’re running an e-commerce store, and you want to predict whether a customer will purchase a product based on their browsing behaviour. You have historical data showing the number of items a customer views, time spent on the site, and whether they eventually made a purchase. This is where Logistic Regression shines. It gives you the probability that a customer will buy based on their behaviour.


One of the key benefits of Logistic Regression is that it doesn’t just predict a binary outcome—it gives you a probability. For example, based on the data from your e-commerce site, it might predict that a customer has a 75% chance of making a purchase. You can then use this information to target customers more effectively, offering discounts to those with a lower probability of buying or sending reminders to those who are likely to convert.


Logistic Regression works by applying a logistic function (which looks like an "S" curve) to convert a continuous prediction into a probability between 0 and 1.


The formula for Logistic Regression looks a little more complicated than Linear Regression, but the key takeaway is that it outputs a probability.

P(Y=1)=1/(1+e^(-(b0+b1x)))


Where:​

P(Y=1) is the probability of the outcome (like making a purchase).

X is the independent variable (like the number of items viewed).

b_0 and b_1 are the coefficients learned during training.

e is Euler’s number (a mathematical constant).


This formula transforms the data into a probability that’s easy to interpret, allowing businesses to act based on likelihood.


Netflix – Using Logistic Regression to Predict Customer Retention


Problem

For subscription-based services like Netflix, customer retention is critical. Netflix wanted to predict which customers were likely to cancel their subscriptions (churn) based on their behaviour, such as viewing habits, login frequency, and customer interactions. The company needed a way to identify at-risk customers so they could intervene with personalized offers or targeted campaigns to retain them.


The Solution

Netflix implemented Logistic Regression to analyse customer behaviour data. By examining various factors such as the number of hours spent watching, the types of shows viewed, and the gaps between viewing sessions, Netflix built a model that gave a churn probability for each customer. This allowed them to identify customers with a high likelihood of cancelling their subscription.


Financial Impact

By identifying customers at risk of churning, Netflix was able to implement retention strategies such as personalized recommendations, special offers, and direct customer engagement. These efforts significantly reduced churn rates. As a result, Netflix was able to retain more subscribers, ultimately saving millions in revenue and improving customer lifetime value. Today, this predictive modelling has helped Netflix maintain a low churn rate of around 2-3%, a key driver in their $7.97 billion quarterly revenue as of 2023.


Let’s look at another interesting case where Logistic regression has played a key role.


Uber – Predicting Driver Retention


The Problem

Uber relies heavily on its driver network to meet customer demand for rides. However, the company faced a critical issue with driver churn, where drivers would leave the platform after a short period, leading to shortages in supply, poor customer experiences, and higher recruitment costs. Uber wanted to predict which drivers were likely to stop driving for the platform in the near future.


The Solution

Uber's data scientists used Logistic Regression to create a model that predicted driver churn. They collected data on driver behaviour—such as ride acceptance rates, hours driven per week, and customer ratings—and used it to determine the probability of each driver quitting the platform. The Logistic Regression model provided Uber with a probability score for each driver, helping them identify high-risk drivers early on.

 

Uber then implemented targeted retention strategies, such as offering bonuses, incentives, and personalized communication to drivers who were likely to leave. By understanding the factors contributing to churn, Uber was able to better manage driver supply.


Financial Impact

Reducing driver churn meant that Uber saved significantly on the costs of recruiting and training new drivers. The model helped the company retain thousands of drivers, reducing the overall churn rate. According to some estimates, Uber saved tens of millions of dollars annually in recruitment costs and improved service levels by retaining more experienced drivers.

 


To manually calculate Logistic Regression in Excel, you can use advanced formulas to simulate the logistic curve:

=1/(1+EXP(-(Intercept + Slope*Predictor Variable)))


Based on the result, classify the prediction (e.g., if probability > 0.5, then assign 1, otherwise 0).


For easier application, it is common to use software like Python for logistic regression due to their built-in support for these models.


Linear vs. Logistic Regression: Key Differences


It’s crucial to understand when to use each algorithm, so let’s break down the main differences between Linear and Logistic Regression:

 These stories show how companies leverage Linear Regression and Logistic Regression to make strategic decisions with far-reaching financial impacts—whether it’s predicting home prices or retaining critical workforce segments. These regression techniques helped Zillow and Uber make smarter, data-driven decisions that contributed to their industry leadership.


Remember, machine learning is all about learning from data, and with time, you too will become more comfortable with these algorithms and the power they bring to solving real-world problems.


Next up, we will cover other supervised machine learning algorithms like KNN, SVM, Naive Bayes, Decision Trees, Random Forest, Boosted Trees & Neural Networks.

28 views0 comments

Recent Posts

See All

Comentários


bottom of page