Using Python Scipy's Distance Functions
When working with machine learning, data analysis, or computational geometry, calculating the distance between points in a multi-dimensional space is a crucial task. Distances are used in clustering algorithms, anomaly detection, and many other applications. Python’s SciPy library provides several distance functions that allow you to calculate different types of distances efficiently. In this blog post, we'll explore three popular distance metrics—Euclidean, Manhattan, and Hamming—using SciPy’s distance functions.
1. Euclidean Distance: .euclidean()
The Euclidean distance is the straight-line distance between two points in Euclidean space. It is perhaps the most commonly used distance metric, especially in clustering algorithms like K-Means and K-Nearest Neighbors (KNN). It is calculated using the Pythagorean theorem.
Mathematically, the Euclidean distance between two points and in an n-dimensional space is given by:
from scipy.spatial.distance import euclidean
point_1 = [1, 2, 3]
point_2 = [4, 5, 6]
dist = euclidean(point_1, point_2)
print("Euclidean Distance:", dist)
When calculating the distance between points, they must have the same dimensions.
2. Manhattan Distance: .cityblock()
The Manhattan distance (the taxicab distance) measures the distance between two points in a grid-based system. Unlike the Euclidean distance, it only allows horizontal and vertical movements. You can think of it as the distance a taxi would travel in a grid-like street layout (think of the streets of Manhattan, hence the name).
The Manhattan distance between two points and is calculated as:
In general, for n-dimensional space, it is:
from scipy.spatial.distance import cityblock
point_1 = [1, 2, 3]
point_2 = [4, 5, 6]
dist = cityblock(point_1, point_2)
print("Manhattan Distance:", dist)
This will return the sum of the absolute differences of the coordinates.
When calculating the distance between points, they must have the same dimensions.
3. Hamming Distance: .hamming()
The Hamming distance measures the number of positions at which the corresponding elements of two strings (or arrays) of equal length are different. This distance is particularly useful for comparing binary strings or sequences in problems related to error correction and information theory.
Mathematically, the Hamming distance is defined as:
where the indicator function returns 1 if and 0 otherwise.
In SciPy, you can compute the Hamming distance using the scipy.spatial.distance.hamming()
function:
from scipy.spatial.distance import hamming
point_1 = [0, 1, 1, 0, 1]
point_2 = [1, 0, 1, 1, 0]
dist = hamming(point_1, point_2)
print("Hamming Distance:", dist)
Here it hamming()
will return the proportion of positions in which the two vectors are different. Note that this function returns a normalized value, which is the fraction of different positions rather than the raw count.
Why Use These Distance Functions?
Each distance metric serves a different purpose, and understanding when to use each one is essential for many applications in data science and machine learning.
- Euclidean Distance: It’s ideal for continuous, numerical data where the straight-line distance between points makes sense, such as in clustering algorithms or nearest neighbor searches.
- Manhattan Distance: This is often used when the data is constrained to a grid or when you need to calculate a distance where only horizontal or vertical movements are allowed (think in a city grid).
- Hamming Distance: It’s best used when comparing categorical data or binary strings, such as in pattern recognition or error correction codes.
Conclusion
SciPy’s distance functions—Euclidean, Manhattan, and Hamming—offer a wide range of tools for calculating distances between data points. Each function is useful in different scenarios, and selecting the right one depends on the nature of your data and the problem you're trying to solve.
By using these functions, you can make powerful computations that help in clustering, classification, and many other data analysis tasks. SciPy's efficient and easy-to-use distance functions are invaluable for anyone working in data science or machine learning.