{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: Identify correlated stocks from F&O stock list\n",
    "summary: Download data for all the stocks traded in F&O, calculate the correlation matrix on close price and finally filter stocks with high correlation\n",
    "tags: [\"jugaad-data\", \"strategies\"]\n",
    "categories: [\"posts\"]\n",
    "date: 2020-08-25T02:02:57Z\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [Go back to index](../)\n",
    "\n",
    "## Finding a needle in haystack\n",
    "There are more than 1600 companies listed in NSE, which gives you more than 1.2 million potential pairs. This is not a very helpful number, even for the computer it might become computationally challenging to handle such a big number. So we need to apply series of fileters to narrow down our search. Here's what I'm going to do-\n",
    "\n",
    "### 1. Narrow down the universe - Consider stocks traded in F&O only\n",
    "Pair trading strategies require you to go long on one stock and short on the other. The only way we can short a stock right now is in futures market (as far as I know). So we can execute pair trading only for stocks in F&O and that'll be our search universe.\n",
    "\n",
    "You can download the list from this link and save in the project folder - https://archives.nseindia.com/content/fo/fo_mktlots.csv)\n",
    "\n",
    "### 2. Filter highly correlated stocks\n",
    "We calculate correlation of every stock with every other stock in F&O list and then filter all correlations above a threshold. You can decide the threshold by trial and error. Higher the threshold, fewer the pairs and vice verca. So you use your judgement for this.\n",
    "\n",
    "### 3. Manual filtering of the pair\n",
    "If required, Filter the list further with criterias like industry, sector and co-dependancy.\n",
    "\n",
    "So lets get started\n",
    "\n",
    "## Install dependencies\n",
    "Let us install `jugaad-data` for downloading data, `pandas` for all the analytics and `seaborn` for plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pandas jugaad-data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Download the list from NSE's website](https://archives.nseindia.com/content/fo/fo_mktlots.csv) and place it in the same folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stocks in F&O: 139\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "# Gather all symbols from CSV\n",
    "with open('fo_mktlots.csv') as fp:\n",
    "    reader = csv.reader(fp)\n",
    "    rows = list(reader)\n",
    "    # stocks start from 5th row and symbols are in 2nd column\n",
    "    fno_stocks = [row[1].lstrip().rstrip() for row in rows[4:]]\n",
    "print(\"Stocks in F&O: {}\".format(len(fno_stocks)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the data\n",
    "\n",
    "Considering it has been a choppy year due to Covid, let us run the experiment for last year. We will use data from first half of the 2019 (Jan to Jun) to derive ratios and second half of the 2019 to test our hypothesis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "###########################################################################################################################################\n",
      "        DATE SERIES     OPEN     HIGH      LOW  PREV. CLOSE     LTP    CLOSE  \\\n",
      "0 2019-06-28     EQ  1324.90  1382.00  1319.30      1312.45  1365.0  1359.20   \n",
      "1 2019-06-27     EQ  1310.55  1322.95  1292.60      1307.95  1313.1  1312.45   \n",
      "2 2019-06-26     EQ  1320.00  1325.05  1301.05      1324.85  1305.5  1307.95   \n",
      "3 2019-06-25     EQ  1345.00  1351.45  1309.95      1344.65  1328.6  1324.85   \n",
      "4 2019-06-24     EQ  1353.90  1357.10  1330.05      1348.25  1345.0  1344.65   \n",
      "\n",
      "      VWAP   52W H  52W L   VOLUME         VALUE  NO OF TRADES      SYMBOL  \n",
      "0  1355.70  1406.0  910.1  1568379  2.126244e+09         61027  APOLLOHOSP  \n",
      "1  1309.95  1406.0  910.1   688891  9.024130e+08         26612  APOLLOHOSP  \n",
      "2  1309.64  1406.0  910.1   597977  7.831336e+08         30659  APOLLOHOSP  \n",
      "3  1326.99  1406.0  910.1  1156668  1.534883e+09         54399  APOLLOHOSP  \n",
      "4  1339.41  1406.0  910.1   420849  5.636875e+08         30942  APOLLOHOSP  \n"
     ]
    }
   ],
   "source": [
    "from datetime import date\n",
    "from jugaad_data import nse\n",
    "import pandas as pd\n",
    "from_date = date(2019,1,1)\n",
    "to_date = date(2019,6,30)\n",
    "dfs = []\n",
    "for stock in fno_stocks:\n",
    "    df = nse.stock_df(stock, from_date, to_date)\n",
    "    dfs.append(df)\n",
    "    print(\"#\", end='')\n",
    "print()\n",
    "print(dfs[0].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data preparation\n",
    "\n",
    "In [varity article](https://zerodha.com/varsity/chapter/pair-stats/) there are two methods of calculating correlation-\n",
    "\n",
    "1. Based on daily closing price\n",
    "2. **Based on daily return series**\n",
    "\n",
    "We are going to calculate returns based on daily returns as recommended in the varsity article. I have tried the first approach as well, its output had a lot of noise, it provided high scores for seemingly uncorrelated stocks also, so use the first method with caution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   APOLLOHOSP  AMBUJACEM  BANDHANBNK  AUROPHARMA       BEL  FEDERALBNK  \\\n",
      "0    0.035620  -0.025635    0.001766   -0.006860 -0.021314   -0.001382   \n",
      "1    0.003440  -0.004103    0.005140   -0.014408  0.004808    0.039253   \n",
      "2   -0.012756   0.019521    0.013928    0.035851  0.004390    0.004810   \n",
      "3   -0.014725   0.000698   -0.011613    0.022855 -0.007407    0.010204   \n",
      "4   -0.002670   0.010101    0.005935   -0.026807  0.029148    0.000486   \n",
      "\n",
      "       BPCL  CADILAHC  ADANIENT  CUMMINSIND  ...     WIPRO     L&TFH  \\\n",
      "0  0.000637  0.016187 -0.015375   -0.014892  ... -0.005672 -0.008183   \n",
      "1  0.002044 -0.004395 -0.006177    0.006554  ... -0.014153  0.005630   \n",
      "2 -0.004325  0.036892  0.019556    0.017295  ...  0.002979  0.020327   \n",
      "3  0.032165  0.005894  0.020981    0.003311  ...  0.004931 -0.005275   \n",
      "4  0.002237 -0.025526  0.006471    0.018963  ... -0.006822 -0.003940   \n",
      "\n",
      "   NATIONALUM       PNB  POWERGRID  BAJFINANCE    RECLTD  MUTHOOTFIN  \\\n",
      "0   -0.017893  0.019872  -0.003852    0.010680  0.011043    0.007421   \n",
      "1    0.004995 -0.011407  -0.009301    0.003928 -0.006098    0.000000   \n",
      "2    0.020387  0.023346   0.039931    0.013436  0.019267    0.000938   \n",
      "3    0.003067  0.003253   0.012811    0.009731  0.032734    0.040348   \n",
      "4   -0.005086 -0.000650  -0.004252   -0.005749  0.007762   -0.014194   \n",
      "\n",
      "   KOTAKBANK   SIEMENS  \n",
      "0  -0.003037  0.013251  \n",
      "1  -0.006038  0.032799  \n",
      "2   0.004515  0.001398  \n",
      "3   0.009044  0.002805  \n",
      "4  -0.012888  0.006209  \n",
      "\n",
      "[5 rows x 139 columns]\n"
     ]
    }
   ],
   "source": [
    "all_stock_df = pd.DataFrame()\n",
    "for i, df in enumerate(dfs):\n",
    "    all_stock_df[fno_stocks[i]] = (df['CLOSE'] - df['PREV. CLOSE'])/df['PREV. CLOSE']\n",
    "print(all_stock_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate correlations\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            APOLLOHOSP  AMBUJACEM  BANDHANBNK  AUROPHARMA       BEL  \\\n",
      "APOLLOHOSP    1.000000   0.320794    0.151532    0.024116  0.305902   \n",
      "AMBUJACEM     0.320794   1.000000    0.217605    0.312233  0.437229   \n",
      "BANDHANBNK    0.151532   0.217605    1.000000    0.092184  0.137484   \n",
      "AUROPHARMA    0.024116   0.312233    0.092184    1.000000  0.193852   \n",
      "BEL           0.305902   0.437229    0.137484    0.193852  1.000000   \n",
      "\n",
      "            FEDERALBNK      BPCL  CADILAHC  ADANIENT  CUMMINSIND  ...  \\\n",
      "APOLLOHOSP    0.260664  0.205749  0.156552  0.317534    0.313665  ...   \n",
      "AMBUJACEM     0.365048  0.445099  0.248823  0.347647    0.379015  ...   \n",
      "BANDHANBNK    0.102062  0.218680  0.169849  0.305171    0.120858  ...   \n",
      "AUROPHARMA    0.182673  0.236757  0.516735  0.186846    0.077641  ...   \n",
      "BEL           0.565289  0.405377  0.245842  0.439243    0.349208  ...   \n",
      "\n",
      "               WIPRO     L&TFH  NATIONALUM       PNB  POWERGRID  BAJFINANCE  \\\n",
      "APOLLOHOSP  0.072392  0.287634    0.328091  0.398197   0.044383    0.214700   \n",
      "AMBUJACEM   0.020011  0.524240    0.332308  0.431475   0.326298    0.363156   \n",
      "BANDHANBNK -0.078667  0.365273    0.142910  0.242020   0.105363    0.243383   \n",
      "AUROPHARMA  0.066967  0.349406    0.131476  0.206744   0.083969    0.150650   \n",
      "BEL        -0.061190  0.509221    0.445504  0.509757   0.260057    0.399151   \n",
      "\n",
      "              RECLTD  MUTHOOTFIN  KOTAKBANK   SIEMENS  \n",
      "APOLLOHOSP  0.189239    0.186098   0.056665  0.306106  \n",
      "AMBUJACEM   0.206820    0.377478   0.253772  0.498506  \n",
      "BANDHANBNK  0.103857    0.244627   0.161533  0.313044  \n",
      "AUROPHARMA  0.183215    0.295931   0.033207  0.344312  \n",
      "BEL         0.308352    0.415240   0.201867  0.393015  \n",
      "\n",
      "[5 rows x 139 columns]\n"
     ]
    }
   ],
   "source": [
    "stock_corr = all_stock_df.corr()\n",
    "print(stock_corr.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter based on threshold\n",
    "I'm setting a threshold of 0.7 and I'm getting below output. You can reduce the threshold and you will see lot many pairs. One more thing to note is you'll see pairs repeating, this is because `corr(a, b) = corr(b, a)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                ACC\n",
      "AMBUJACEM  0.806813\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "      HINDPETRO       IOC\n",
      "BPCL   0.798469  0.702593\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "            JSWSTEEL\n",
      "JINDALSTEL  0.722094\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "            BAJFINANCE\n",
      "BAJAJFINSV    0.702545\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "         EQUITAS\n",
      "UJJIVAN  0.73688\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "         UJJIVAN\n",
      "EQUITAS  0.73688\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "          JINDALSTEL\n",
      "JSWSTEEL    0.722094\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "       BANKBARODA      SBIN       PNB\n",
      "CANBK    0.794729  0.701325  0.803416\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "     AMBUJACEM\n",
      "ACC   0.806813\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "           L&TFH\n",
      "M&MFIN  0.732753\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "               BPCL       IOC\n",
      "HINDPETRO  0.798469  0.738071\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "         BPCL  HINDPETRO\n",
      "IOC  0.702593   0.738071\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "          ULTRACEMCO\n",
      "RAMCOCEM    0.729297\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "               CANBK       PNB\n",
      "BANKBARODA  0.794729  0.774867\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "         CANBK\n",
      "SBIN  0.701325\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "            RAMCOCEM\n",
      "ULTRACEMCO  0.729297\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "         M&MFIN\n",
      "L&TFH  0.732753\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "        CANBK  BANKBARODA\n",
      "PNB  0.803416    0.774867\n",
      "------------------------------------\n",
      "\n",
      "\n",
      "            BAJAJFINSV\n",
      "BAJFINANCE    0.702545\n",
      "------------------------------------\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "threshold = 0.7\n",
    "for stock in fno_stocks[:len(fno_stocks)]:\n",
    "    k = stock_corr[(stock_corr[stock] > threshold) & (stock_corr[stock] < 1)][stock]\n",
    "    if k.any():\n",
    "        print(k.to_frame().T)\n",
    "        print('------------------------------------\\n\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyse the output\n",
    "\n",
    "Below are the pairs identified-\n",
    "\n",
    "1. ACC and AMBUJACEM\n",
    "2. BPCL, HINDPETRO and IOC\n",
    "3. JINDALSTEL and JSWSTEEL\n",
    "4. BAJFINANCE and BAJAJFINSV\n",
    "5. UJJIVAN and EQUITAS\n",
    "6. BANKBARODA, SBIN, PNB and CANBK\n",
    "7. L&TFH and M&MFIN\n",
    "8. RAMCOCEM and ULTRACEMCO\n",
    "\n",
    "I dont see any anomaly in this particular output, but you should always validate the output before further exploration.\n",
    "\n",
    "### [Next step](../pair-stats/)\n",
    "\n",
    "### [Go back to index](../)\n",
    "\n",
    "### [Download notebook](../identify-stocks.ipynb)\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}