Add DBSCAN clustering algorithm in machine_learning/#14851
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds a new machine_learning/dbscan.py module implementing the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm from scratch, providing a density-based alternative to existing clustering implementations in the repository.
Changes:
- Adds a from-scratch DBSCAN implementation (
dbscan) with input validation and-1noise labeling. - Adds supporting helpers (
euclidean_distance,get_neighbors) used by the clustering routine. - Includes doctests and reference links (Wikipedia + original paper) and a
__main__doctest runner.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Point Types: | ||
| - Core point: Has at least `min_points` neighbors within `epsilon` distance | ||
| - Border point: Within `epsilon` of a core point, but has fewer than | ||
| `min_points` neighbors | ||
| - Noise point: Neither core nor border — labeled as -1 |
| def get_neighbors( | ||
| data: list[list[float]], point_index: int, epsilon: float | ||
| ) -> list[int]: | ||
| """ | ||
| Return indices of all points within epsilon distance of data[point_index]. | ||
|
|
||
| >>> data = [[0.0, 0.0], [0.1, 0.1], [5.0, 5.0]] | ||
| >>> get_neighbors(data, 0, 0.5) | ||
| [0, 1] | ||
| >>> get_neighbors(data, 2, 0.5) | ||
| [2] | ||
| >>> get_neighbors(data, 0, 10.0) | ||
| [0, 1, 2] | ||
| """ | ||
| return [ | ||
| index | ||
| for index, point in enumerate(data) | ||
| if euclidean_distance(data[point_index], point) <= epsilon | ||
| ] |
zain-cs
left a comment
There was a problem hiding this comment.
Good implementation overall! The docstrings, type hints, and doctests are comprehensive and well-written.
One observation in dbscan(): after reassigning a noise point to the current cluster here:
if labels[current_point] == -1:
labels[current_point] = current_cluster_idthe already_in_another check immediately below will never trigger for that point since it was just assigned. The guard is only meaningful for points that were previously assigned to another cluster — consider reordering the logic to make the intent clearer.
Otherwise looks solid! 👍
|
Thank you for the review @zain-cs! You're right the already_in_another check was redundant after the noise reassignment. I've reordered the logic so the guard runs before the assignment, making the intent clearer. Please take another look when you get a chance! |
zain-cs
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback @PRERITARYA! The updated logic is
much cleaner — running the cluster membership guard before the
assignment makes the intent immediately clear.
I also noticed the ruff check is flagging W292 (no newline at end of
file) — a quick fix would be adding a trailing newline to get the
checks passing. 👍
Describe your change:
Add DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
clustering algorithm implemented from scratch without any external ML libraries.
Checklist:
Fixes #10683