Investment Portfolio Diversification By Using Data Analysis in Python

Hesam Andalib
9 min readDec 22, 2020


This project is a short sprint on data analysis with stock market datasets to see how we can make better decisions on investment portfolio without getting too deep in the fundamental understanding of interconnections between companies. You can download this project or read it on my Github in a more extensive style:

Modern Portfolio Theory (MPT) stresses that investors should look for a consistently uncorrelated pool of assets to limit risk. In practical terms, that virtually guarantees a diversified portfolio. Correlation coefficient is a central component of Modern Portfolio Theory (MPT).

Correlation statistics can be used in finance and investing. The correlation coefficient has limited ability in predicting returns in the stock for individual stocks. Still, the statistical measurement may have value in predicting the extent to which two stocks move in relation to each other because the correlation coefficient is a measure of the relationship between how two stocks move in tandem with each other, as well as the strength of that relationship.

Correlation Coefficient

Correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The Values range between -1.0 and 1.0. A correlation of -1.0 shows a perfect negative correlation while a correlation of 1.0 shows a perfect positive correlation.
There are several types of correlation coefficients, but the one that is most common is the pearson correlation (r). This measures the strength and direction of the linear relationship between two variables.

One of the main criticisms of MPT is that it assumes the correlation between assets is static over time. In reality, correlations often shift, especially during periods of higher volatility. (Source:

Project Objective

In this project I want to use historical data for stock prices to present a new way of looking at correlation between stocks that considers changes in the correlation coefficient over time. In other words, I want to use data to help investors diversify their investment portfolio by having a better understanding of stock price correlations over time.

By considering the changes in the stock price of a company and comparing those changes with other companies, I hope to get some insights on the relations between companies that cannot be easily understood by following news or studying static correlation coefficient tables.

Project Stages

This project had been passed 5 stages of progress to get to this point:
- Stage 1: Gathering Data
- Stage 2: Cleaning and Processing Data
- Stage 3: Converting Data to Pandas DataFrame
- Stage 4: Visualizing Data

Stage1: Gathering Data

for gathering required data, I used API as the source where I can get data from. This API had a good documentation and I was able to connect to the API and get the data easily and save the data in .csv files on my computer. For calling the API, I used a list of 100 companies in S&P500 list to test my workflow and outputs:

def update_dataset():

import finnhub
import pandas as pd
# Setup client for Finnhub API
finnhub_client = finnhub.Client(api_key ="###############")

for company in companies:
# Stock candles
res = finnhub_client.stock_candles(company, 'D', start_date, end_date)

#Convert to Pandas Dataframe
response = pd.DataFrame(res)

#Write the CSV file out of data frame
response_csv = response.to_csv(f"{company}.csv")
print(company, 'no data found ...')

Stage 2: Cleaning and Processing Data

The output of the last step is 100 csv files saved on the computer. These files contain stock price of each company for each day within the timeframe specified. However, what I need is the price change from day to day. For this purpose I considered the ‘closing’ price as the one that I need. Now I have to read each file, get the closing price of each day and calculate the

(c_day(n) — c_day(n-1)) where ‘c’ is the closing price.

For the purpose of this project, I need 3 data points for my calculations:
- Price change in each pair of dates
- Percent of change in each pair of dates
- A pair of consecutive dates

I defined a data structure with name DeltaObject to store these data points:

class DeltaObject:
def __init__(self, delta_change, percent_change, dates):
self.change = delta_change
self.percent = percent_change
self.dates = dates
def __repr__(self):
return f'Change:{self.change},Percent: {self.percent}, Dates: {self.dates}'

Now I can create a dictionary in which each company will be a key in this dictionary and a list of all objects of type DeltaObject is the value for that key.

def refine_companies_records():

companies_records = {}

#get company data for each company that exists in file directory
for company in companies: if (company + '.csv') in files:
company_data = get_company_data(company)

close_price_list = []
for line in company_data:
date_price = []
line = line.strip('\n').split(',')

change_list = []
for i in range(1,len(close_price_list)):
delta_change = float(close_price_list[i][0]) - float(close_price_list[i-1][0])
percent_change = delta_change / float(close_price_list[i-1][0])
dates =(close_price_list[i-1][1],close_price_list[i][1])


companies_records[company]= change_list

return companies_records

Stage 3: Converting data to Pandas DataFrame

The goal of this project is to distill the data into a simple graph that shows the interdependencies between companies’ stock prices.

Up to this point, I have a dictionary named companies_records that entails all the information needed for price changes for each couple of existing dates in the specified timeframe.

The objective is to get a graph like this:

This graph shows the comparison of companies W, X, Y, Z with company A in the timeframe of D1-D5. For example if changes in the stock price of company A is in the same path (i.e. both increase or both decrease) of changes in company X, we should have a green square and if they are going opposites (eg. A is increasing but the other one is decreasing) we should have a red square in the graph. The opacity of the color also shows the extent to which changes are same or opposite.

For getting such output, I figured out that I need to change my data to a dictionary in which keys are delta times, and the value for each delta time is a color tuple. Since I am going to use matplotlib library as the plotting library in Python, I can manage the color with a tuple of (red, green, blue, alpha) . Since I only want to have two colors with different opacities, I must use one of these for each datapoint:

(1,0,0,alpha) for varieties of red

(0,1,0,alpha) for varieties of green

Therefore for each DeltaObject that I defined before and its values exist in the company_data dictionary, I can define a color tuple. For example, if we want to get the first colored square of company 'C' in comparison with company 'A', we have to do the following:

company_data = {'A':[DeltaObject0, DeltaObject1, ...], 'B': [DeltaObject0, DeltaObject1, ...], ...}

selected_company = 'C'
date_0_c = company_data['C'][0].dates
change_0_c = company_data['C'][0].change
percent_0_c = company_data['C'][0].percent

#data for company A
date_0_A = company_data['A'][0].dates
change_0_A = company_data['A'][0].change
percent_0_A = company_data['A'][0].percent

change = change_0_c * change_0_A#for determining the color
percent = percent_0_A/percent_0_c#for determining the color opacity

if change >0:
color = (0,1,0, percent)#Green color with aplpha = percent
color = (1,0,0,percent)# Red color with alpha = percent

with keeping the idea explained above, the following function was used to calculate all the datapoints in the time range specified by the user:

def get_compared_datapoints(company_data):

data_points = []
dict_of_dates = {}

for i in company_data:
date = i.dates
change = i.change
percent = i.percent


for symbol in companies_records:
company_to_compare = companies_records[symbol]

for k in company_to_compare:
date_k = k.dates
change_k = k.change
percent_k = k.percent

#set color of the comparison result
if date == date_k:
if (change*change_k)<=0:
same_change = [1,0,0]#False #show in Red color
same_change = [0,1,0]#True #show in Green color

# Set Color Opacity
relational_opacity = round(abs(percent_k/percent),2)
if relational_opacity >=1:
relational_opacity = 1


relational_opacity = 0

data_point = tuple(same_change)

company_date_dict = {symbol:data_point}


dict_of_dates[date] = lc

results = dict_of_dates

#merge separate dictionaries that are created in the last step
for date in results:
for d in results[date]:
results[date] = merged_dict
#returns a nested dictionary {time:{company_1:(Color tuple,..., company_n:(color_tuple))}}
return results

Stage 4: Visualizing Data

Unfortunatly the existing options in the matplotlib library do not offer a simple solution for the type of graph that I am looking for. The reason is that usual data visualization graphs are designed to represent the changes of one value that is associated with changed in another value. For example changes in price versus changes in time. These two values can be easily measured against each other. However, the type of graph that I am looking for has 3 dimensions:

- On one axis I want to have name of the company
- On another axis I want to have a pair of consecutive dates
- The intersection of a company with a pair of dates shows only a colored square

For the purpose of this visualization, I explored scatter plot and bar chart as potential options that I can modify in some way to get the final out put. The good thing about these plots is that Matplotlib allows to draw multiple times on a single frame. In this way, I can visualize data points for a pair of dates and draw the next row of data on top of it. Bar chart has an advantage of letting me specify the bottom of the bar for each data point. I was able to achieve my goal using the following code:

def plot_graph():

selection = int(input("Number of companies to compare: "))

x_companies = companies[:selection]
y_date = 1

colors_data = data_frame.values[:selection]

#Plot the bar chart
for j in range(colors_data.shape[1]):
#print(f'Days-pairC #{j} Done ...')
for i in range(selection):
#print(i,j,colors_data[i][j])[i], y_date,bottom=j, color=colors_data[i][j])

#Set plot specifications
plt.title(f"Correlation with {company}'s stock price change")
plt.xticks(x_companies, rotation=90)
plt.ylabel(f"From {start_date} to {till_date}")

For example if you want to see the correlation of changes in stock price for 10 companies and Apple stock price, from 01/01/2019 to 01/01/2020, you will get a graph like this:

Interpreting the Graph

For diversification, we look for investments that move in opposite of our main choices. In the graph above for example, if you are going to invest on Apple stock, to diversify your investment portfolio, you have to look for the company that its bar graph is more of the red color rather than green. Red color shows that they move opposite of each other in stock prices. In this graph we can see that AES and AAP seem to be better choices for diversification.


Dealing with large datasets and processing them for a specific use is really time consuming and mind intensive. I am happy with the result of this project so far but I believe the follwoing steps are necessary to make this program more uable:

  • Since I convert a dictionary to a dataframe, there is no rule for sorting companies that dataframe understands, so when I select to compare the company with 10 companies, the selected companies have no special order in the dataframe. I have to define a process for sorting companies in the dataframe from largest to the smallest one.
  • There is a bottle neck in this program at get_compared_datapoints function. This takes to much time when data files are large and the selected period of time is longe. I have to find a way to increase the efficiency and speed of the program.
  • I think for having an insightful output, each company must be compared with at least top 100 companies in the database. However, it is not usable to put all the data to a single plot. Using subplots and dividing plots to 10 plot in which every one entails 10 companies can be more understandable for users. At the next step, I have to discover using of subplots.
  • Though visualization helps to get a lot of information with a quick look, this data analysis can be useful when it is accompanied with some numerical values. For example having the percentage of green data points for each company can be so beneficial for having a more indepth study of the graph.