There are more than 1600 companies listed in NSE, which gives you more than 1.2 million potential pairs. This is not a very helpful number, even for the computer it might become computationally challenging to handle such a big number. So we need to apply series of fileters to narrow down our search. Here’s what I’m going to do-
Pair trading strategies require you to go long on one stock and short on the other. The only way we can short a stock right now is in futures market (as far as I know). So we can execute pair trading only for stocks in F&O and that’ll be our search universe.
You can download the list from this link and save in the project folder - https://archives.nseindia.com/content/fo/fo_mktlots.csv)
We calculate correlation of every stock with every other stock in F&O list and then filter all correlations above a threshold. You can decide the threshold by trial and error. Higher the threshold, fewer the pairs and vice verca. So you use your judgement for this.
If required, Filter the list further with criterias like industry, sector and co-dependancy.
So lets get started
Let us install jugaad-data
for downloading data, pandas
for all the analytics and seaborn
for plotting
!pip install pandas jugaad-data
Download the list from NSE’s website and place it in the same folder.
import csv
# Gather all symbols from CSV
with open('fo_mktlots.csv') as fp:
reader = csv.reader(fp)
rows = list(reader)
# stocks start from 5th row and symbols are in 2nd column
fno_stocks = [row[1].lstrip().rstrip() for row in rows[4:]]
print("Stocks in F&O: {}".format(len(fno_stocks)))
Stocks in F&O: 139
Considering it has been a choppy year due to Covid, let us run the experiment for last year. We will use data from first half of the 2019 (Jan to Jun) to derive ratios and second half of the 2019 to test our hypothesis
from datetime import date
from jugaad_data import nse
import pandas as pd
from_date = date(2019,1,1)
to_date = date(2019,6,30)
dfs = []
for stock in fno_stocks:
df = nse.stock_df(stock, from_date, to_date)
dfs.append(df)
print("#", end='')
print()
print(dfs[0].head())
###########################################################################################################################################
DATE SERIES OPEN HIGH LOW PREV. CLOSE LTP CLOSE \
0 2019-06-28 EQ 1324.90 1382.00 1319.30 1312.45 1365.0 1359.20
1 2019-06-27 EQ 1310.55 1322.95 1292.60 1307.95 1313.1 1312.45
2 2019-06-26 EQ 1320.00 1325.05 1301.05 1324.85 1305.5 1307.95
3 2019-06-25 EQ 1345.00 1351.45 1309.95 1344.65 1328.6 1324.85
4 2019-06-24 EQ 1353.90 1357.10 1330.05 1348.25 1345.0 1344.65
VWAP 52W H 52W L VOLUME VALUE NO OF TRADES SYMBOL
0 1355.70 1406.0 910.1 1568379 2.126244e+09 61027 APOLLOHOSP
1 1309.95 1406.0 910.1 688891 9.024130e+08 26612 APOLLOHOSP
2 1309.64 1406.0 910.1 597977 7.831336e+08 30659 APOLLOHOSP
3 1326.99 1406.0 910.1 1156668 1.534883e+09 54399 APOLLOHOSP
4 1339.41 1406.0 910.1 420849 5.636875e+08 30942 APOLLOHOSP
In varity article there are two methods of calculating correlation-
We are going to calculate returns based on daily returns as recommended in the varsity article. I have tried the first approach as well, its output had a lot of noise, it provided high scores for seemingly uncorrelated stocks also, so use the first method with caution.
all_stock_df = pd.DataFrame()
for i, df in enumerate(dfs):
all_stock_df[fno_stocks[i]] = (df['CLOSE'] - df['PREV. CLOSE'])/df['PREV. CLOSE']
print(all_stock_df.head())
APOLLOHOSP AMBUJACEM BANDHANBNK AUROPHARMA BEL FEDERALBNK \
0 0.035620 -0.025635 0.001766 -0.006860 -0.021314 -0.001382
1 0.003440 -0.004103 0.005140 -0.014408 0.004808 0.039253
2 -0.012756 0.019521 0.013928 0.035851 0.004390 0.004810
3 -0.014725 0.000698 -0.011613 0.022855 -0.007407 0.010204
4 -0.002670 0.010101 0.005935 -0.026807 0.029148 0.000486
BPCL CADILAHC ADANIENT CUMMINSIND ... WIPRO L&TFH \
0 0.000637 0.016187 -0.015375 -0.014892 ... -0.005672 -0.008183
1 0.002044 -0.004395 -0.006177 0.006554 ... -0.014153 0.005630
2 -0.004325 0.036892 0.019556 0.017295 ... 0.002979 0.020327
3 0.032165 0.005894 0.020981 0.003311 ... 0.004931 -0.005275
4 0.002237 -0.025526 0.006471 0.018963 ... -0.006822 -0.003940
NATIONALUM PNB POWERGRID BAJFINANCE RECLTD MUTHOOTFIN \
0 -0.017893 0.019872 -0.003852 0.010680 0.011043 0.007421
1 0.004995 -0.011407 -0.009301 0.003928 -0.006098 0.000000
2 0.020387 0.023346 0.039931 0.013436 0.019267 0.000938
3 0.003067 0.003253 0.012811 0.009731 0.032734 0.040348
4 -0.005086 -0.000650 -0.004252 -0.005749 0.007762 -0.014194
KOTAKBANK SIEMENS
0 -0.003037 0.013251
1 -0.006038 0.032799
2 0.004515 0.001398
3 0.009044 0.002805
4 -0.012888 0.006209
[5 rows x 139 columns]
stock_corr = all_stock_df.corr()
print(stock_corr.head())
APOLLOHOSP AMBUJACEM BANDHANBNK AUROPHARMA BEL \
APOLLOHOSP 1.000000 0.320794 0.151532 0.024116 0.305902
AMBUJACEM 0.320794 1.000000 0.217605 0.312233 0.437229
BANDHANBNK 0.151532 0.217605 1.000000 0.092184 0.137484
AUROPHARMA 0.024116 0.312233 0.092184 1.000000 0.193852
BEL 0.305902 0.437229 0.137484 0.193852 1.000000
FEDERALBNK BPCL CADILAHC ADANIENT CUMMINSIND ... \
APOLLOHOSP 0.260664 0.205749 0.156552 0.317534 0.313665 ...
AMBUJACEM 0.365048 0.445099 0.248823 0.347647 0.379015 ...
BANDHANBNK 0.102062 0.218680 0.169849 0.305171 0.120858 ...
AUROPHARMA 0.182673 0.236757 0.516735 0.186846 0.077641 ...
BEL 0.565289 0.405377 0.245842 0.439243 0.349208 ...
WIPRO L&TFH NATIONALUM PNB POWERGRID BAJFINANCE \
APOLLOHOSP 0.072392 0.287634 0.328091 0.398197 0.044383 0.214700
AMBUJACEM 0.020011 0.524240 0.332308 0.431475 0.326298 0.363156
BANDHANBNK -0.078667 0.365273 0.142910 0.242020 0.105363 0.243383
AUROPHARMA 0.066967 0.349406 0.131476 0.206744 0.083969 0.150650
BEL -0.061190 0.509221 0.445504 0.509757 0.260057 0.399151
RECLTD MUTHOOTFIN KOTAKBANK SIEMENS
APOLLOHOSP 0.189239 0.186098 0.056665 0.306106
AMBUJACEM 0.206820 0.377478 0.253772 0.498506
BANDHANBNK 0.103857 0.244627 0.161533 0.313044
AUROPHARMA 0.183215 0.295931 0.033207 0.344312
BEL 0.308352 0.415240 0.201867 0.393015
[5 rows x 139 columns]
I’m setting a threshold of 0.7 and I’m getting below output. You can reduce the threshold and you will see lot many pairs. One more thing to note is you’ll see pairs repeating, this is because corr(a, b) = corr(b, a)
.
threshold = 0.7
for stock in fno_stocks[:len(fno_stocks)]:
k = stock_corr[(stock_corr[stock] > threshold) & (stock_corr[stock] < 1)][stock]
if k.any():
print(k.to_frame().T)
print('------------------------------------\n\n')
ACC
AMBUJACEM 0.806813
------------------------------------
HINDPETRO IOC
BPCL 0.798469 0.702593
------------------------------------
JSWSTEEL
JINDALSTEL 0.722094
------------------------------------
BAJFINANCE
BAJAJFINSV 0.702545
------------------------------------
EQUITAS
UJJIVAN 0.73688
------------------------------------
UJJIVAN
EQUITAS 0.73688
------------------------------------
JINDALSTEL
JSWSTEEL 0.722094
------------------------------------
BANKBARODA SBIN PNB
CANBK 0.794729 0.701325 0.803416
------------------------------------
AMBUJACEM
ACC 0.806813
------------------------------------
L&TFH
M&MFIN 0.732753
------------------------------------
BPCL IOC
HINDPETRO 0.798469 0.738071
------------------------------------
BPCL HINDPETRO
IOC 0.702593 0.738071
------------------------------------
ULTRACEMCO
RAMCOCEM 0.729297
------------------------------------
CANBK PNB
BANKBARODA 0.794729 0.774867
------------------------------------
CANBK
SBIN 0.701325
------------------------------------
RAMCOCEM
ULTRACEMCO 0.729297
------------------------------------
M&MFIN
L&TFH 0.732753
------------------------------------
CANBK BANKBARODA
PNB 0.803416 0.774867
------------------------------------
BAJAJFINSV
BAJFINANCE 0.702545
------------------------------------
Below are the pairs identified-
I dont see any anomaly in this particular output, but you should always validate the output before further exploration.