top of page

Unsupervised Learning for Stock Selection: K-Means clustering

 1 Introduction

In the dynamic and complex world of financial markets, effective stock selection is pivotal for investors seeking to maximize returns while managing risk. Traditional methods of stock selection often rely on fundamental and technical analyses, which can be time-consuming and subject to human biases. With the advent of advanced data analytics and machine learning techniques, there is a growing interest in leveraging these tools to enhance investment decision-making processes. One such promising approach is non-hierarchical cluster analysis, a method rooted in unsupervised learning, which groups stocks based on inherent similarities in their characteristics without prior assumptions about the number of clusters.

Non-hierarchical cluster analysis, specifically k-means clustering, offers a robust framework for iden tifying patterns and relationships within large datasets, making it particularly suitable for the stock market’s vast and volatile landscape. This method allows investors to categorize stocks into distinct groups based on various features such as price movements, trading volumes, and financial ratios. By doing so, it provides a structured way to uncover hidden structures in the data that might not be evident through conventional analysis.

This paper explores the application of non-hierarchical cluster analysis to stock selection, emphasizing its potential to enhance investment strategies. We will delve into the methodology of k-means clus tering, discussing its advantages over hierarchical clustering methods, and demonstrate its practical application through a comprehensive case study. The objective is to showcase how this technique can aid in identifying groups of stocks with similar behavior, thereby enabling more informed and diversi f ied investment choices.

In the subsequent sections, we will provide a detailed overview of the clustering algorithm, present our data collection and preprocessing steps, and analyze the results of applying non-hierarchical cluster analysis to a selected stock dataset. Through this exploration, we aim to highlight the effectiveness of this approach in uncovering valuable insights and facilitating more strategic stock selection decisions in the ever-evolving financial markets.

2 Distance Calculation

In k-means clustering, the choice of distance metric significantly impacts the formation and character istics of clusters. The most commonly used distance metric is the Euclidean distance, which measures the straight-line distance between two points in multidimensional space and is computationally effi cient. Another important metric is the Manhattan distance (or L1 norm), which calculates the distance between two points by summing the absolute differences of their coordinates, useful in scenarios with high-dimensional data where differences along each dimension are equally significant. The Mahalanobis distance is a more sophisticated metric that accounts for correlations between variables and scales data based on their variances and covariances, making it ideal for datasets with correlated features. Other 1 distance metrics include the Chebyshev distance, which considers the maximum absolute difference across all dimensions, and the cosine similarity, which measures the cosine of the angle between two vectors, particularly effective for high-dimensional, sparse datasets. Each distance metric offers unique advantages and is chosen based on the specific characteristics and requirements of the clustering task at hand.

In this case we are going to focus just on the Mahalanobis distance (the Euclidean distance is just a specific case of the Mahalanobis distance, where correlation is absent)

2.1 Mahalanobis distance

The Mahalanobis distance is a multivariate measure that calculates the distance between a point and a distribution, taking into account the correlations among the variables. Unlike the Euclidean distance, which treats all variables as equally important and uncorrelated, the Mahalanobis distance considers the variance and covariance of the data, making it particularly useful for identifying multivariate out liers and dealing with correlated features.

To understand how the Mahalanobis distance works, consider a dataset with a mean vector µ rep resenting the average values of each variable. The covariance matrix ϵ of the dataset captures the variances of each variable and the covariances between pairs of variables. For a given point x the Mahalanobis distance Dm from x to the mean vector µ is calculated as

In this formula (x − µ) represents the difference vector between the point and the mean, (x − µ)T is its transpose and epsilon is the inverse covariance matrix. The Mahalanobis distance thus accounts for the spread of the data and the correlations between variables, providing a measure that standardizes the scale of the data and transforms correlated variables into uncorrelated ones through the covariance matrix. The resulting distance indicates how far a point is from the mean in a multivariate context. Points closer to the mean vector, in terms of this distance, are considered more typical, while those further away are considered outliers. This makes the Mahalanobis distance particularly effective for clustering, anomaly detection, and other multivariate analyses where data dimensions are correlated and variances differ.

3 K-means Clustering

K-means clustering is a popular and straightforward unsupervised machine learning algorithm used for partitioning a dataset into a set of distinct, non-overlapping groups or clusters. The algorithm works by initializing a predefined number of cluster centroids, usually chosen randomly. It then iteratively refines these centroids by assigning each data point to the nearest centroid based on the Euclidean distance, recalculating the centroids as the mean of all points assigned to each cluster, and repeating this process until convergence. Convergence is typically achieved when the assignments of data points to clusters no longer change or change minimally. The primary goal of k-means clustering is to minimize the variance within each cluster while maximizing the variance between different clusters, thereby ensuring that each cluster is as distinct as possible. This technique is widely used for its simplicity and efficiency, making it an attractive choice for various applications, including market segmentation, image compression, and, notably, stock selection in financial analysis. In mathematical notation our goal, given a set of observation x1,x2,...,xn 2 is to partition these n observation into k sets with k ≤n formally the minimization of the Sum of Squares within the sets can be written as:

here µi is the centroid of the points in

 |Si| is the size of Si and ||.|| is the usual L2norm1. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:

Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares). The iterative process of k-means clustering begins by randomly initializing a set number of cluster centroids. Each data point is then assigned to the nearest centroid based on a chosen distance metric,

Where m(1) 1 ,···(k) is the initial set of means. After this initial assignment, the centroids are recalculated as the mean of all points assigned to each cluster, thereby moving the centroids to the center of their respective clusters. This reassignment and recalculation process repeats iteratively: each data point is reassigned to the nearest updated centroid, and the centroids are recalculated again as following:

This iterative cycle continues until convergence is reached, which occurs when the centroids no longer change significantly between iterations, or a predefined number of iterations is completed. The k-means algorithm aims to minimize the total variance within clusters, ensuring that data points within each cluster are as close to each other as possible while maximizing the distance between different clusters. This simple yet powerful process allows k-means to efficiently partition data into meaningful groups, making it widely used for various applications, including market segmentation, document clustering, and stock selection.

3.1 Optimal number of clusters

Identifying the optimal number of clusters in k-means clustering is a crucial step to ensure meaningful and interpretable results. One common method to determine this is the Elbow Method, which involves running the k-means algorithm for a range of cluster numbers and plotting the within-cluster sum of squares (WCSS) against the number of clusters. The WCSS typically decreases as the number of clus ters increases, as clusters become more refined. The optimal number is often indicated by an ”elbow” point in the plot, where the rate of decrease sharply slows, suggesting diminishing returns in reducing WCSS with additional clusters. Another approach is the Silhouette Analysis, which measures how similar each point is to its own cluster compared to other clusters. The silhouette coefficient ranges from-1 to 1, with higher values indicating better-defined clusters. The optimal number of clusters maximizes the average silhouette coefficient.

Additionally, methods like the Gap Statistic and the Bayesian Information Criterion (BIC) provide more advanced and statistically robust techniques for determining the appropriate number of clusters, considering both the fit and complexity of the model. By employing these methods, practitioners can select a cluster number that balances simplicity and explanatory power, enhancing the reliability and interpretability of the k-means clustering results.

The Elbow Method is a heuristic used in k-means clustering to identify the optimal number of clus ters by examining how the within-cluster variance changes as the number of clusters increases. This method involves plotting the Residual Sum of Squares (RSS) or the Within-Cluster Sum of Squares (WCSS) against the number of clusters. The RSS is a measure of the total variance within the clusters, calculated as the sum of the squared distances between each data point and its corresponding cluster centroid. As the number of clusters increases, the RSS typically decreases because each cluster is better able to fit its data points. The plot usually shows a sharp decrease in RSS for a few clusters, followed by a point where the rate of decrease slows down significantly. This point, where the curve bends or forms an ”elbow,” indicates the optimal number of clusters. Adding more clusters beyond this point yields minimal gains in explaining the variance, thus suggesting a natural stopping point. The Elbow Method can also be applied using the Semi-Partial Residual Sum of Squares (SPRSS), which measures the reduction in RSS when an additional cluster is added. To employ this approach, one would compute the SPRSS for each possible number of clusters by subtracting the RSS of the model with k−1 clusters. This difference quantifies how much additional explanatory power is gained by increasing the number of clusters. Plotting SPRSS against the number of clusters can highlight an ”elbow” point similar to the standard RSS plot. The optimal number of clusters is indicated where adding another cluster provides diminishing returns in terms of the reduction in RSS. Both the RSS-based and SPRSS-based Elbow Methods help in identifying a balance between the complexity of the clustering model and its explanatory power, ensuring that the chosen number of clusters is both parsimonious and effective in capturing the structure of the data.

4 Practical Application

Our sample will consists of the constituents of the S&P 500, but we will consider only the stocks that have been listed since 2000. For our analysis we are considering 5 different variables: Volume of trading, Market Cap, Revenues, P/E ratio and Beta. The first step of the procedure will consist of standardizing the variables, since they are on different scales this could spoil the results.

4.1 Outliers detection

The preliminary treatment of outliers using the Mahalanobis distance starts by calculating this dis tance for each data point in the dataset. The Mahalanobis distance measures how far each point is from the multivariate mean, considering the correlations between variables, making it an effective tool for identifying outliers in a multivariate context. Each data point’s Mahalanobis distance is squared and then divided by the number of variables in the dataset. For large samples, a common rule of thumb is to consider a data point as an outlier if this value exceeds 4. This criterion indicates that the point is significantly distant from the core cluster of data, suggesting it deviates substantially from the typical distribution of the dataset.

Outliers identified in this manner can unduly influence the results of statistical analyses, leading to skewed conclusions and misinterpretations. Therefore, these points are carefully examined and typ ically removed from the dataset. This removal process ensures that the subsequent analyses are more accurate and reliable, as they are based on a dataset that better represents the underlying population without the distortion caused by extreme values. By eliminating these outliers, we can achieve more robust statistical inferences and improve the overall quality of the data analysis. Thus, the preliminary treatment of outliers using the Mahalanobis distance is a crucial step in the data cleaning process, enhancing the integrity and validity of the analytical outcomes.

4.2 Optimal number of Groups

In the next step we have to determine the optimal number of groups for stock selection, we initially employed hierarchical clustering as a preliminary step. Hierarchical clustering provided a visual and intuitive way to explore the structure of the data through a dendrogram, a tree-like diagram that displays the arrangement of the clusters formed at each stage of the algorithm. By analyzing the dendrogram, we were able to identify potential clusters by examining where to ”cut” the tree, which reveals distinct groupings of stocks. This method helped in estimating a suitable number of clusters by visually inspecting the height of the branches. Following the first criterion the optimal number of group should be 3, since the branch on the left is a bit longer we are cutting the dendrogram just after it.

Concurrently, we applied the Elbow Method to the hierarchical clustering results. We plotted the within-cluster sum of squares (WCSS) against the number of clusters and looked for the point where the rate of decrease in WCSS significantly slowed down, forming an ”elbow.” This elbow point indi cated a diminishing return on the variance explained by adding more clusters. By combining insights from both the dendrogram cut method and the Elbow Method, we derived a well-supported and ro bust estimate of the optimal number of clusters to be used in the subsequent non-hierarchical k-means clustering process. This dual-method approach ensured a more reliable and accurate determination of the appropriate number of clusters for our stock selection analysis. From the second analysis we can identify the optimal number which is between 2 and 4, supporting the previous result.

This is what we would get if we decided to use the semi-partial residual sum of squares, just for providing full context. Usually we rely just on the tree and one of the other two methods.

4.3 Clustering

In SAS, clustering using PROC CANDISC and PROC FASTCLUSis atwo-step process that effectively identifies and visualizes distinct groups within the data. The process begins with PROC CANDISC (Canonical Discriminant Analysis), which transforms the original multivariate data into a smaller set of canonical variables. These canonical variables are linear combinations of the original variables, de 6 signed to maximize the separation between predefined groups. By reducing the dimensionality of the data, PROC CANDISC emphasizes the differences between the groups, making it easier to distinguish between them.

After generating the canonical variables, PROC FASTCLUS is applied to perform clustering. PROC FASTCLUS is a k-means clustering procedure that rapidly partitions the observations into a specified number of clusters. The process starts by selecting initial cluster centroids, then iteratively reassigning each observation to the nearest centroid. The centroids are recalculated as the mean of the observa tions in each cluster. This iterative process continues until the cluster assignments stabilize, indicating convergence.

The clusters formed by PROC FASTCLUS are then plotted on a graph with the canonical variables on the axes, providing a clear visual representation of the data distribution within the clusters. This visualization helps in understanding how the clusters are separated based on the canonical variables, which encapsulate the most discriminative information from the original dataset. The combined use of PROC CANDISC and PROC FASTCLUS thus results in well-defined, distinct clusters that are easy to interpret and analyze, enhancing the overall understanding of the data’s structure and relationships.

4.4 Profiling

Group profiling involves examining the characteristics of each cluster to understand what differentiates them. In this context, we performed a t-test in SAS for each cluster across several variables (revenue, market cap, P/E ratio, beta, and trading volume) to determine if these variables significantly differ between clusters. The t-test evaluates whether the means of the variables in different clusters are statistically distinct, providing insights into the unique profiles of each group.

The procedure begins with the t-test analysis in SAS, which outputs several key pieces of information. For each variable, SAS provides the mean, standard deviation, and standard error for each cluster. It also calculates the t-statistic and corresponding p-value, which indicate whether the differences in means between clusters are statistically significant. A low p-value (typically less than 0.05) suggests that the differences are significant. Additionally, SAS generates several graphical outputs for each variable. The distribution of the vari able is displayed both as a histogram (normal distribution) and fitted with a kernel density estimate, which provides a smoothed curve representing the distribution of the data. These graphs help in visu alizing the central tendency and spread of the variable within each cluster. Below an example of the results from the t-test performed on the first cluster.

Furthermore, SAS produces Q-Q (quantile-quantile) plots for each variable. A Q-Q plot compares the quantiles of the variable’s distribution against the quantiles of a normal distribution. If the points on the Q-Q plot fall along a straight line, it indicates that the variable follows a normal distribution. Deviations from the line suggest departures from normality, such as skewness or kurtosis. Below an example of the results from the t-test performed on the first cluster.

Reading these graphs involves looking at the histograms and kernel density plots to assess the shape, center, and spread of the data distributions in each cluster. The Q-Q plots are used to verify the normality of the distributions. These visualizations, combined with the t-test results, provide a com prehensive profile of each cluster, highlighting the significant differences in the variables of interest and helping to understand the unique characteristics that define each group.

From our analysis we obtained that the null hypothesis of indipendence between the cluster and the variables P/E and Beta cannot be rejected for the 3 groups. The results of our analysis indicate that the clusters of SP 500 stocks are significantly different in terms of revenue, market cap, and trading volume. This suggests that these variables play a crucial role in differentiating the clusters, providing valuable insights for portfolio selection and investment strate gies. The significant differences in revenue across clusters imply that the clusters contain companies with varying income levels. Some clusters may represent high-revenue companies, potentially indicat ing mature, stable businesses, while others may consist of lower-revenue firms, possibly representing growth-oriented or emerging companies. The variation in market cap suggests that clusters contain companies of different sizes. Large-cap stocks typically offer stability and lower volatility, whereas small-cap stocks might offer higher growth potential but with greater risk. The differences in trading volume indicate varying levels of liquidity across clusters. High trading volumes often suggest better liquidity, allowing for easier buying and selling of stocks without significantly affecting the stock price, while low trading volumes might indicate less liquidity and higher volatility.

Understanding these significant differences allows investors to diversify their portfolios more effec tively. By including stocks from different clusters, investors can balance stability from high-revenue, large-cap, high-liquidity stocks with growth potential from lower-revenue, small-cap, lower-liquidity stocks. This information can also help in managing risk, as allocating a larger portion of the portfolio to clusters with high market cap and trading volume can reduce overall portfolio volatility, given that these stocks are typically more stable and liquid. Additionally, investors with specific strategies can tailor their portfolios accordingly. Growth investors might focus on clusters with lower revenue and market cap, anticipating higher future growth, while value investors might prefer clusters with higher revenue and larger market caps, seeking stable returns. Awareness of trading volume differences helps in planning for liquidity needs. Investors requiring high liquidity can prioritize clusters with higher trading volumes, ensuring they can enter and exit positions with minimal price impact.

In summary, the significant differences in revenue, market cap, and trading volume across the clusters of SP 500 stocks provide crucial insights for portfolio selection. By leveraging these insights, investors can create diversified, risk-managed portfolios tailored to their specific investment strategies and liq uidity requirements. This approach enhances the potential for achieving desired investment outcomes while mitigating associated risks.


bottom of page