Identify correlated stocks from F&O stock list

Go back to index

Finding a needle in haystack

There are more than 1600 companies listed in NSE, which gives you more than 1.2 million potential pairs. This is not a very helpful number, even for the computer it might become computationally challenging to handle such a big number. So we need to apply series of fileters to narrow down our search. Here’s what I’m going to do-

1. Narrow down the universe - Consider stocks traded in F&O only

Pair trading strategies require you to go long on one stock and short on the other. The only way we can short a stock right now is in futures market (as far as I know). So we can execute pair trading only for stocks in F&O and that’ll be our search universe.

You can download the list from this link and save in the project folder - https://archives.nseindia.com/content/fo/fo_mktlots.csv)

2. Filter highly correlated stocks

We calculate correlation of every stock with every other stock in F&O list and then filter all correlations above a threshold. You can decide the threshold by trial and error. Higher the threshold, fewer the pairs and vice verca. So you use your judgement for this.

3. Manual filtering of the pair

If required, Filter the list further with criterias like industry, sector and co-dependancy.

So lets get started

Install dependencies

Let us install jugaad-data for downloading data, pandas for all the analytics and seaborn for plotting

!pip install pandas jugaad-data

Download the list from NSE’s website and place it in the same folder.

import csv
# Gather all symbols from CSV
with open('fo_mktlots.csv') as fp:
    reader = csv.reader(fp)
    rows = list(reader)
    # stocks start from 5th row and symbols are in 2nd column
    fno_stocks = [row[1].lstrip().rstrip() for row in rows[4:]]
print("Stocks in F&O: {}".format(len(fno_stocks)))
Stocks in F&O: 139

Download the data

Considering it has been a choppy year due to Covid, let us run the experiment for last year. We will use data from first half of the 2019 (Jan to Jun) to derive ratios and second half of the 2019 to test our hypothesis


from datetime import date
from jugaad_data import nse
import pandas as pd
from_date = date(2019,1,1)
to_date = date(2019,6,30)
dfs = []
for stock in fno_stocks:
    df = nse.stock_df(stock, from_date, to_date)
    dfs.append(df)
    print("#", end='')
print()
print(dfs[0].head())
###########################################################################################################################################
        DATE SERIES     OPEN     HIGH      LOW  PREV. CLOSE     LTP    CLOSE  \
0 2019-06-28     EQ  1324.90  1382.00  1319.30      1312.45  1365.0  1359.20   
1 2019-06-27     EQ  1310.55  1322.95  1292.60      1307.95  1313.1  1312.45   
2 2019-06-26     EQ  1320.00  1325.05  1301.05      1324.85  1305.5  1307.95   
3 2019-06-25     EQ  1345.00  1351.45  1309.95      1344.65  1328.6  1324.85   
4 2019-06-24     EQ  1353.90  1357.10  1330.05      1348.25  1345.0  1344.65   

      VWAP   52W H  52W L   VOLUME         VALUE  NO OF TRADES      SYMBOL  
0  1355.70  1406.0  910.1  1568379  2.126244e+09         61027  APOLLOHOSP  
1  1309.95  1406.0  910.1   688891  9.024130e+08         26612  APOLLOHOSP  
2  1309.64  1406.0  910.1   597977  7.831336e+08         30659  APOLLOHOSP  
3  1326.99  1406.0  910.1  1156668  1.534883e+09         54399  APOLLOHOSP  
4  1339.41  1406.0  910.1   420849  5.636875e+08         30942  APOLLOHOSP  

Data preparation

In varity article there are two methods of calculating correlation-

  1. Based on daily closing price
  2. Based on daily return series

We are going to calculate returns based on daily returns as recommended in the varsity article. I have tried the first approach as well, its output had a lot of noise, it provided high scores for seemingly uncorrelated stocks also, so use the first method with caution.

all_stock_df = pd.DataFrame()
for i, df in enumerate(dfs):
    all_stock_df[fno_stocks[i]] = (df['CLOSE'] - df['PREV. CLOSE'])/df['PREV. CLOSE']
print(all_stock_df.head())
   APOLLOHOSP  AMBUJACEM  BANDHANBNK  AUROPHARMA       BEL  FEDERALBNK  \
0    0.035620  -0.025635    0.001766   -0.006860 -0.021314   -0.001382   
1    0.003440  -0.004103    0.005140   -0.014408  0.004808    0.039253   
2   -0.012756   0.019521    0.013928    0.035851  0.004390    0.004810   
3   -0.014725   0.000698   -0.011613    0.022855 -0.007407    0.010204   
4   -0.002670   0.010101    0.005935   -0.026807  0.029148    0.000486   

       BPCL  CADILAHC  ADANIENT  CUMMINSIND  ...     WIPRO     L&TFH  \
0  0.000637  0.016187 -0.015375   -0.014892  ... -0.005672 -0.008183   
1  0.002044 -0.004395 -0.006177    0.006554  ... -0.014153  0.005630   
2 -0.004325  0.036892  0.019556    0.017295  ...  0.002979  0.020327   
3  0.032165  0.005894  0.020981    0.003311  ...  0.004931 -0.005275   
4  0.002237 -0.025526  0.006471    0.018963  ... -0.006822 -0.003940   

   NATIONALUM       PNB  POWERGRID  BAJFINANCE    RECLTD  MUTHOOTFIN  \
0   -0.017893  0.019872  -0.003852    0.010680  0.011043    0.007421   
1    0.004995 -0.011407  -0.009301    0.003928 -0.006098    0.000000   
2    0.020387  0.023346   0.039931    0.013436  0.019267    0.000938   
3    0.003067  0.003253   0.012811    0.009731  0.032734    0.040348   
4   -0.005086 -0.000650  -0.004252   -0.005749  0.007762   -0.014194   

   KOTAKBANK   SIEMENS  
0  -0.003037  0.013251  
1  -0.006038  0.032799  
2   0.004515  0.001398  
3   0.009044  0.002805  
4  -0.012888  0.006209  

[5 rows x 139 columns]

Calculate correlations

stock_corr = all_stock_df.corr()
print(stock_corr.head())
            APOLLOHOSP  AMBUJACEM  BANDHANBNK  AUROPHARMA       BEL  \
APOLLOHOSP    1.000000   0.320794    0.151532    0.024116  0.305902   
AMBUJACEM     0.320794   1.000000    0.217605    0.312233  0.437229   
BANDHANBNK    0.151532   0.217605    1.000000    0.092184  0.137484   
AUROPHARMA    0.024116   0.312233    0.092184    1.000000  0.193852   
BEL           0.305902   0.437229    0.137484    0.193852  1.000000   

            FEDERALBNK      BPCL  CADILAHC  ADANIENT  CUMMINSIND  ...  \
APOLLOHOSP    0.260664  0.205749  0.156552  0.317534    0.313665  ...   
AMBUJACEM     0.365048  0.445099  0.248823  0.347647    0.379015  ...   
BANDHANBNK    0.102062  0.218680  0.169849  0.305171    0.120858  ...   
AUROPHARMA    0.182673  0.236757  0.516735  0.186846    0.077641  ...   
BEL           0.565289  0.405377  0.245842  0.439243    0.349208  ...   

               WIPRO     L&TFH  NATIONALUM       PNB  POWERGRID  BAJFINANCE  \
APOLLOHOSP  0.072392  0.287634    0.328091  0.398197   0.044383    0.214700   
AMBUJACEM   0.020011  0.524240    0.332308  0.431475   0.326298    0.363156   
BANDHANBNK -0.078667  0.365273    0.142910  0.242020   0.105363    0.243383   
AUROPHARMA  0.066967  0.349406    0.131476  0.206744   0.083969    0.150650   
BEL        -0.061190  0.509221    0.445504  0.509757   0.260057    0.399151   

              RECLTD  MUTHOOTFIN  KOTAKBANK   SIEMENS  
APOLLOHOSP  0.189239    0.186098   0.056665  0.306106  
AMBUJACEM   0.206820    0.377478   0.253772  0.498506  
BANDHANBNK  0.103857    0.244627   0.161533  0.313044  
AUROPHARMA  0.183215    0.295931   0.033207  0.344312  
BEL         0.308352    0.415240   0.201867  0.393015  

[5 rows x 139 columns]

Filter based on threshold

I’m setting a threshold of 0.7 and I’m getting below output. You can reduce the threshold and you will see lot many pairs. One more thing to note is you’ll see pairs repeating, this is because corr(a, b) = corr(b, a).

threshold = 0.7
for stock in fno_stocks[:len(fno_stocks)]:
    k = stock_corr[(stock_corr[stock] > threshold) & (stock_corr[stock] < 1)][stock]
    if k.any():
        print(k.to_frame().T)
        print('------------------------------------\n\n')
                ACC
AMBUJACEM  0.806813
------------------------------------


      HINDPETRO       IOC
BPCL   0.798469  0.702593
------------------------------------


            JSWSTEEL
JINDALSTEL  0.722094
------------------------------------


            BAJFINANCE
BAJAJFINSV    0.702545
------------------------------------


         EQUITAS
UJJIVAN  0.73688
------------------------------------


         UJJIVAN
EQUITAS  0.73688
------------------------------------


          JINDALSTEL
JSWSTEEL    0.722094
------------------------------------


       BANKBARODA      SBIN       PNB
CANBK    0.794729  0.701325  0.803416
------------------------------------


     AMBUJACEM
ACC   0.806813
------------------------------------


           L&TFH
M&MFIN  0.732753
------------------------------------


               BPCL       IOC
HINDPETRO  0.798469  0.738071
------------------------------------


         BPCL  HINDPETRO
IOC  0.702593   0.738071
------------------------------------


          ULTRACEMCO
RAMCOCEM    0.729297
------------------------------------


               CANBK       PNB
BANKBARODA  0.794729  0.774867
------------------------------------


         CANBK
SBIN  0.701325
------------------------------------


            RAMCOCEM
ULTRACEMCO  0.729297
------------------------------------


         M&MFIN
L&TFH  0.732753
------------------------------------


        CANBK  BANKBARODA
PNB  0.803416    0.774867
------------------------------------


            BAJAJFINSV
BAJFINANCE    0.702545
------------------------------------

Analyse the output

Below are the pairs identified-

  1. ACC and AMBUJACEM
  2. BPCL, HINDPETRO and IOC
  3. JINDALSTEL and JSWSTEEL
  4. BAJFINANCE and BAJAJFINSV
  5. UJJIVAN and EQUITAS
  6. BANKBARODA, SBIN, PNB and CANBK
  7. L&TFH and M&MFIN
  8. RAMCOCEM and ULTRACEMCO

I dont see any anomaly in this particular output, but you should always validate the output before further exploration.

Next step

Go back to index

Download notebook