Investment Portfolio Diversification By Using Data Analysis in Python

Correlation Coefficient

Correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The Values range between -1.0 and 1.0. A correlation of -1.0 shows a perfect negative correlation while a correlation of 1.0 shows a perfect positive correlation.
There are several types of correlation coefficients, but the one that is most common is the pearson correlation (r). This measures the strength and direction of the linear relationship between two variables.

Project Objective

In this project I want to use historical data for stock prices to present a new way of looking at correlation between stocks that considers changes in the correlation coefficient over time. In other words, I want to use data to help investors diversify their investment portfolio by having a better understanding of stock price correlations over time.

Project Stages

This project had been passed 5 stages of progress to get to this point:
- Stage 1: Gathering Data
- Stage 2: Cleaning and Processing Data
- Stage 3: Converting Data to Pandas DataFrame
- Stage 4: Visualizing Data

Stage1: Gathering Data

for gathering required data, I used finnhub.io API as the source where I can get data from. This API had a good documentation and I was able to connect to the API and get the data easily and save the data in .csv files on my computer. For calling the API, I used a list of 100 companies in S&P500 list to test my workflow and outputs:

def update_dataset():

import finnhub
import pandas as pd
# Setup client for Finnhub API
finnhub_client = finnhub.Client(api_key ="###############")

for company in companies:
try:
# Stock candles
res = finnhub_client.stock_candles(company, 'D', start_date, end_date)


#Convert to Pandas Dataframe
response = pd.DataFrame(res)

#Write the CSV file out of data frame
response_csv = response.to_csv(f"{company}.csv")
except:
print(company, 'no data found ...')

Stage 2: Cleaning and Processing Data

The output of the last step is 100 csv files saved on the computer. These files contain stock price of each company for each day within the timeframe specified. However, what I need is the price change from day to day. For this purpose I considered the ‘closing’ price as the one that I need. Now I have to read each file, get the closing price of each day and calculate the

class DeltaObject:
def __init__(self, delta_change, percent_change, dates):
self.change = delta_change
self.percent = percent_change
self.dates = dates
def __repr__(self):
return f'Change:{self.change},Percent: {self.percent}, Dates: {self.dates}'
def refine_companies_records():

companies_records = {}

#get company data for each company that exists in file directory
for company in companies: if (company + '.csv') in files:
company_data = get_company_data(company)

close_price_list = []
for line in company_data:
date_price = []
line = line.strip('\n').split(',')
#close_price_list.append(line[1])
date_price.append(line[1])
date_price.append(line[6])
close_price_list.append(date_price)

change_list = []
for i in range(1,len(close_price_list)):
delta_change = float(close_price_list[i][0]) - float(close_price_list[i-1][0])
percent_change = delta_change / float(close_price_list[i-1][0])
dates =(close_price_list[i-1][1],close_price_list[i][1])

change_list.append(DeltaObject(delta_change,percent_change,dates))

companies_records[company]= change_list

return companies_records

Stage 3: Converting data to Pandas DataFrame

The goal of this project is to distill the data into a simple graph that shows the interdependencies between companies’ stock prices.

company_data = {'A':[DeltaObject0, DeltaObject1, ...], 'B': [DeltaObject0, DeltaObject1, ...], ...}

selected_company = 'C'
date_0_c = company_data['C'][0].dates
change_0_c = company_data['C'][0].change
percent_0_c = company_data['C'][0].percent

#data for company A
date_0_A = company_data['A'][0].dates
change_0_A = company_data['A'][0].change
percent_0_A = company_data['A'][0].percent

change = change_0_c * change_0_A#for determining the color
percent = percent_0_A/percent_0_c#for determining the color opacity

if change >0:
color = (0,1,0, percent)#Green color with aplpha = percent
else:
color = (1,0,0,percent)# Red color with alpha = percent
def get_compared_datapoints(company_data):

data_points = []
dict_of_dates = {}

for i in company_data:
date = i.dates
change = i.change
percent = i.percent

lc=[]

for symbol in companies_records:
company_to_compare = companies_records[symbol]

for k in company_to_compare:
date_k = k.dates
change_k = k.change
percent_k = k.percent

#set color of the comparison result
if date == date_k:
if (change*change_k)<=0:
same_change = [1,0,0]#False #show in Red color
else:
same_change = [0,1,0]#True #show in Green color

# Set Color Opacity
try:
relational_opacity = round(abs(percent_k/percent),2)
if relational_opacity >=1:
relational_opacity = 1

except:

relational_opacity = 0

same_change.append(relational_opacity)
data_point = tuple(same_change)
data_points.append(data_point)

company_date_dict = {symbol:data_point}

lc.append(company_date_dict)

dict_of_dates[date] = lc

results = dict_of_dates

#merge separate dictionaries that are created in the last step
for date in results:
merged_dict={}
for d in results[date]:
merged_dict.update(d)
results[date] = merged_dict
#returns a nested dictionary {time:{company_1:(Color tuple,..., company_n:(color_tuple))}}
return results

Stage 4: Visualizing Data

Unfortunatly the existing options in the matplotlib library do not offer a simple solution for the type of graph that I am looking for. The reason is that usual data visualization graphs are designed to represent the changes of one value that is associated with changed in another value. For example changes in price versus changes in time. These two values can be easily measured against each other. However, the type of graph that I am looking for has 3 dimensions:

def plot_graph():

selection = int(input("Number of companies to compare: "))

x_companies = companies[:selection]
y_date = 1

colors_data = data_frame.values[:selection]

#Plot the bar chart
for j in range(colors_data.shape[1]):
#print(f'Days-pairC #{j} Done ...')
for i in range(selection):
#print(i,j,colors_data[i][j])
plt.bar(x_companies[i], y_date,bottom=j, color=colors_data[i][j])


#Set plot specifications
plt.title(f"Correlation with {company}'s stock price change")
plt.xticks(x_companies, rotation=90)
plt.ylabel(f"From {start_date} to {till_date}")
plt.xlabel("Companies")

Interpreting the Graph

For diversification, we look for investments that move in opposite of our main choices. In the graph above for example, if you are going to invest on Apple stock, to diversify your investment portfolio, you have to look for the company that its bar graph is more of the red color rather than green. Red color shows that they move opposite of each other in stock prices. In this graph we can see that AES and AAP seem to be better choices for diversification.

Reflection

Dealing with large datasets and processing them for a specific use is really time consuming and mind intensive. I am happy with the result of this project so far but I believe the follwoing steps are necessary to make this program more uable:

  • Since I convert a dictionary to a dataframe, there is no rule for sorting companies that dataframe understands, so when I select to compare the company with 10 companies, the selected companies have no special order in the dataframe. I have to define a process for sorting companies in the dataframe from largest to the smallest one.
  • There is a bottle neck in this program at get_compared_datapoints function. This takes to much time when data files are large and the selected period of time is longe. I have to find a way to increase the efficiency and speed of the program.
  • I think for having an insightful output, each company must be compared with at least top 100 companies in the database. However, it is not usable to put all the data to a single plot. Using subplots and dividing plots to 10 plot in which every one entails 10 companies can be more understandable for users. At the next step, I have to discover using of subplots.
  • Though visualization helps to get a lot of information with a quick look, this data analysis can be useful when it is accompanied with some numerical values. For example having the percentage of green data points for each company can be so beneficial for having a more indepth study of the graph.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store