A BI Tool for Collecting Online Financial Information using NLP (Use case: Amazon)
April 14, 2022
In this article, you will learn an end-to-end web scraping and NLP process to gather financial business intelligence about a given company via sentiment analysis and keyword extraction. We will use Amazon as an example use case to scrape financial news and discover business events and sentiment tone of the organization:
- This tool will mine the web via three APIs (NEWS API, FINVIZ, and GoogleNews) to extract and gather financial news about a target company on a specific day.
- Tools used:
urllib.request
,BeautifulSoup
,regex
- Tools used:
- It will then do sentiment analysis on news headlines to predict the tone (percent of positive, negative, and neutral).
- Tools used:
nltk.sentiment.vader
,gensim.parsing.preprocessing
,nltk.stem
- Tools used:
- Lastly, it will extract important company events from the news texts that could impact their stock price, e.g. new products to launch, M&A, stock buybacks or splits, increase or decrease in hiring.”
- Tools used:
KeywordProcessor
fromflashtext
library
- Tools used:
- This tool is deployed in Streamlit, and the link is provided below (the Streamlit code will not be discussed as part of this article).
- Streamlit app: https://share.streamlit.io/samfaar/bi-financial-app/main
- GitHub page: https://github.com/samfaar/BI-Financial-App
Web Scraping via FINVIZ, NEWS API, and GoogleNews
Google does not like being scraped, mainly because Google Search itself is literally a mighty web scraper. As a result, Google has mechanisms to “limit” scraping its search results. For example, you might write a python code that scrapes Google search results today, but it will break whenever Google changes the CSS classes used on the search engine results pages. If it stops working, you’ll need to view the source of the page, inspect the elements and tags you are trying to parse, and update the CSS identifier accordingly. I was able to scrape news via both methods of APIs (using GoogleNews
https://pypi.org/project/GoogleNews/) and sending requests to https://www.google.com/search?q={keyword}
, but eventually, they both stopped working after a few days. For this reason, we will perform our news scraping with FINVIZ (http://finviz.com) and NEWS API (https://newsapi.org/docs/client-libraries/python) to ensure consistently successful results, and will also add GoogleNews
API with a try and except block to handle possible exceptions.
FINVIZ
Why news from FINVIZ? FINVIZ has a list of trusted websites, and headlines from these sites tend to be more consistent in their jargon than those from independent bloggers. Consistent textual patterns will improve the sentiment analysis scores.
The code below shows how we connect to FINVIZ search URL using Request
and extracting the news for a given company ticker symbol (AMZN for Amazon in our example) as a data frame, called “news”.
from urllib.request import Request, urlopen from bs4 import BeautifulSoup as soup import pandas as pd # Let's pick a company ticker symbol (AMZN for Amazon) company_ticker = 'AMZN' # Add the ticker symbol to the "finviz" search box url url = ("http://finviz.com/quote.ashx?t=" + company_ticker.lower()) # Most websites block requests that are without a User-Agent header (these simulate a typical browser) # Send a Request to the url and return an html file req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # open and read the request webpage = urlopen(req).read() # make a soup using BeautifulSoup from webpage html = soup(webpage, "html.parser") # Extract the 'class' = 'fullview-news-outer' from our html code, and create a dataframe from it news = pd.read_html(str(html), attrs={'class': 'fullview-news-outer'})[0] # extract the links for each news by finding all the "a" tags and 'class' = 'tab-link-news' links = [] for a in html.find_all('a', class_="tab-link-news"): links.append(a['href']) # Clean up our news dataframe news.columns = ['Date', 'News_Headline'] news['Article_Link'] = links news.head()
As you can see above, the “Date” column has two formats: data and time & only time. This is because when the date is the same, FINVIZ only shows the time and does not repeat the data. We will clean this using regular expression (regex).
import re # extract time as a new column news['time'] = news['Date'].apply(lambda x: ''.join(re.findall(r'[a-zA-Z]{1,9}-d{1,2}-d{1,2}s(.+)', x))) # fill empty cells by the times mentioned in the "Date" column news.loc[news['time'] == '', 'time'] = news['Date'] news
Next, we extract the date from our “Date” column.
import numpy as np news['date'] = news['Date'].apply(lambda x: ''.join(re.findall(r'([a-zA-Z]{1,9}-d{1,2}-d{1,2})s.+', x))) # change empty cells to NaN type in the new "date" column news.loc[news['date'] == '', 'date'] = np.nan # fillna() by forward filling news.fillna(method = 'ffill', inplace = True) news
At last, we combine the two “date” and “time” columns and convert them to datetime
type, followed by cleaning our data frame.
# combine "date" & "time" columns and convert to datetime type news['datetime'] = pd.to_datetime(news['date'] + ' ' + news['time']) # clean out dataframe news.drop(['Date', 'time', 'date'], axis = 1, inplace = True) news.sort_values('datetime', inplace = True) news.reset_index(drop=True, inplace =True) news.columns = ['news_headline', 'url', 'datetime'] News
Note that we have data times that are older than the search date (2022-04-01), and we will remove them at the end of our scraping (once we combine all of our scraping results) to include only relevant dates.
NEWS API
The code below extracts news as a data frame, called df_newsapi
. We will also do some cleaning on the data frame.
from newsapi.newsapi_client import NewsApiClient company_ticker = 'AMZN' search_date = '2022-04-01' newsapi = NewsApiClient(api_key='3a2d0a55066041dc81e3acfbd665fc6e') # extract "articles", which will be a dictionary articles = newsapi.get_everything(q=company_ticker, from_param=search_date, language="en", sort_by="publishedAt", page_size=100) # we want to get the "articles" key from our "articles" dictionary df_newsapi = pd.DataFrame(articles['articles']) df_newsapi.head()
# do some cleaning of the df_newsapi df_newsapi.drop(['author', 'urlToImage'], axis=1, inplace=True) df_newsapi.rename({'publishedAt': 'datetime'}, axis=1, inplace = True) df_newsapi.rename({'title': 'news_headline'}, axis=1, inplace = True) df_newsapi['source'] = df_newsapi['source'].map(lambda x: x['name']) df_newsapi.head()
GoogleNews
The python code for the news extraction via GoolgeNews
is given below. We used Config
because sometimes newspaper
package might not be able to download an article due to the restriction in accessing the article with a specified URL. To bypass that restriction, we set the user_agent
variable in order to parse those restricted articles and get authorized. Also, the connection may occasionally time out, as it uses the Python module requests so to prevent that from happening, we have used config.request_timeout
.
We are going to limit our news extraction to the first two pages of results (From Google News). We could write a for loop to go through multiple pages of results, but hose repetitive requests to Google are going to be automatically detected and stopped (will return a failed connection error).
from GoogleNews import GoogleNews from newspaper import Config import re company_ticker = 'AMZN' search_date = '2022-04-02' # GoogleNews sometime returns an empty dataframe, so we add a try and except Block for handling those exceptions try: user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0' config = Config() config.browser_user_agent = user_agent config.request_timeout = 10 df_google = pd.DataFrame() # change the format of date string from YYYY-MM-DD to MM/DD/YYYY so that is works with GoogleNews start_date = re.sub(r'(d{4})-(d{1,2})-(d{1,2})', '2/3/1', search_date) # Extract News with Google News ---> gives only 10 results per request googlenews = GoogleNews(start=start_date) googlenews.search(company_ticker) # store the results of the first result page result1 = googlenews.result() df_google1 = pd.DataFrame(result1) # store the results of the 2nd result page googlenews.clear() googlenews.getpage(2) result2 = googlenews.result() df_google2 = pd.DataFrame(result2) df_google = pd.concat([df_google1, df_google2]) # do some cleaning of the df_google DF if df_google.shape[0] != 0: df_google.drop(['img', 'date'], axis=1, inplace=True) df_google.columns = ['news_headline', 'source', 'datetime', 'description', 'url'] display(df_google.head()) except: pass
Combining All in one Web-Scraping Custom Function
We now write a custom function that scrapes the news with FINVIZ and News API and combines all results into a single dataframe. Our custom function (get_news) has two user string inputs: company ticker symbol, and date for collecting news that needs to be in YYYY-MM-DD format. We will later use user input boxes for these items in our streamlit app.
from urllib.request import Request, urlopen from bs4 import BeautifulSoup as soup import pandas as pd import re import numpy as np from newsapi.newsapi_client import NewsApiClient def get_news(company_ticker, search_date): ## newsapi newsapi = NewsApiClient(api_key='3a2d0a55066041dc81e3acfbd665fc6e') articles = newsapi.get_everything(q=company_ticker, from_param=search_date, language="en", sort_by="publishedAt", page_size=100) df_newsapi = pd.DataFrame(articles['articles']) # do some cleaning of the DF df_newsapi.drop(['author', 'urlToImage'], axis=1, inplace=True) df_newsapi.rename({'publishedAt': 'datetime'}, axis=1, inplace = True) df_newsapi.rename({'title': 'news_headline'}, axis=1, inplace = True) df_newsapi['source'] = df_newsapi['source'].map(lambda x: x['name']) ## finviz url = ("http://finviz.com/quote.ashx?t=" + company_ticker.lower()) req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() html = soup(webpage, "html.parser") news = pd.read_html(str(html), attrs={'class': 'fullview-news-outer'})[0] links = [] for a in html.find_all('a', class_="tab-link-news"): links.append(a['href']) # Clean up news dataframe news.columns = ['Date', 'News_Headline'] news['Article_Link'] = links # >>> clean "Date" column and create a new "datetime" column # extract time news['time'] = news['Date'].apply(lambda x: ''.join(re.findall(r'[a-zA-Z]{1,9}-d{1,2}-d{1,2}s(.+)', x))) news.loc[news['time'] == '', 'time'] = news['Date'] #extract date news['date'] = news['Date'].apply(lambda x: ''.join(re.findall(r'([a-zA-Z]{1,9}-d{1,2}-d{1,2})s.+', x))) news.loc[news['date'] == '', 'date'] = np.nan news.fillna(method = 'ffill', inplace = True) # convert to datetime type news['datetime'] = pd.to_datetime(news['date'] + ' ' + news['time']) news.drop(['Date', 'time', 'date'], axis = 1, inplace = True) news.sort_values('datetime', inplace = True) news.reset_index(drop=True, inplace =True) news.columns = ['news_headline', 'url', 'datetime'] df_finviz = news.copy()## GoogleNews# GoogleNews sometime returns an empty dataframe, so we add a try and except Block for handling those exceptions try: user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0' config = Config() config.browser_user_agent = user_agent config.request_timeout = 10 df_google = pd.DataFrame() # change the format of date string from YYYY-MM-DD to MM/DD/YYYY so that is works with GoogleNews start_date = re.sub(r'(d{4})-(d{1,2})-(d{1,2})', '2/3/1', search_date) # Extract News with Google News ---> gives only 10 results per request googlenews = GoogleNews(start=start_date) googlenews.search(company_ticker) # store the results of the first result page result1 = googlenews.result() df_google1 = pd.DataFrame(result1) # store the results of the 2nd result page googlenews.clear() googlenews.getpage(2) result2 = googlenews.result() df_google2 = pd.DataFrame(result2) df_google = pd.concat([df_google1, df_google2]) # do some cleaning of the df_google DF if df_google.shape[0] != 0: df_google.drop(['img', 'date'], axis=1, inplace=True) df_google.columns = ['news_headline', 'source', 'datetime', 'description', 'url'] except: pass ## Add the 3 DFs together df_news = pd.concat([df_newsapi, df_finviz, df_google], ignore_index=True) df_news['datetime'] = pd.to_datetime(df_news['datetime'], format = '%Y-%m-%d %H:%M:%S') df_news.set_index('datetime', inplace = True) # only returning the rows that match our search_date df_news = df_news[df_news.index.to_period('D') == search_date] df_news.sort_index(inplace = True) # Get clean source column from urls using regex df_news['source'] = df_news['url'].map(lambda x: ''.join(re.findall(r"https?://(?:www.)?([A-Za-z_0-9.-]+).*", x))) return df_news
Here is what we get when we run our custom function for “AMZN” on “2022-04-01”:
df_news = get_news('AMZN', '2022-04-01') df_news.shape >>> (68, 5)
Sentiment Analysis on News Headlines via NLTK VADER
Sentiment Analysis Methods
There are two main methods for Sentiment Analysis (SA):
1. Rules-based SA (NLTK VADER, TextBlob)
- Attaches a positive or negative rating to certain words (ex. horrible has a negative association), pays attention to negation if it exists, and returns values based on these words. This tends to work fine, and has the advantage of being simple and extremely fast, but has some weaknesses:
- As sentences get longer, more neutral words exist, and therefore, the overall score tends to normalize more towards neutral as well (or does it)
- Sarcasm and jargon are often misinterpreted
2. Vector-based SA (Flair)
- Each word is represented inside a vector space. Words with vector representations most similar to another word are often used in the same context. This allows us, to, therefore, determine the sentiment of any given vector, and therefore, any given sentence.
- Weaknesses:
- Flair tends to be much slower than its rule-based counterparts but comes at the advantage of being a trained NLP model instead of a rule-based model, which, if done well, comes with added performance.
- To put in perspective how much slower, in running 1200 sentences, NLTK took 0.78 seconds, TextBlob took an impressive 0.55 seconds, and Flair took 49 seconds (50–100x longer), which begs whether the added accuracy is truly worth the increased runtime.
The performance of each method depends on the type of text that is analyzed, and it is recommended to test them all before selecting a final SA method. You can also design your own sentiment analysis tool using supervised ML (https://python-bloggers.com/2020/10/how-to-run-sentiment-analysis-in-python-using-vader/).
For the purpose of developing our tool, NLTK VADER was used as it showed the best SA results. The VADER library returns 4 values, such as:
- pos: The probability of the sentiment to be positive
- neu: The probability of the sentiment to be neutral
- neg: The probability of the sentiment to be negative
- compound (from -1 to 1): The normalized compound score, which calculates the sum of all lexicon ratings and takes values from -1 to 1. </aside>
Notice that the pos, neu and neg probabilities add up to 1, and here are the meaning of typical threshold values for compound score:
- positive: compound score ≥ 0.05
- neutral: compound score between -0.05 and 0.05
- negative: compound score ≤ -0.05
Obtaining VADER SA Scores on News Headlines
The python code for VADER SA is given below, which extracts the compound SA score of the news headlines in a new column of our dataframe.
import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA # download these 3 when run for the first time nltk.download('vader_lexicon') nltk.download('movie_reviews') nltk.download('punkt') def nltk_vader_score(text): sentiment_analyzer = SIA() # we take "compound score" (from -1 to 1): The normalized compound score which calculates the sum of all lexicon ratings sent_score = sentiment_analyzer.polarity_scores(text)['compound'] return sent_score df_news['sentiment_score_vader'] = df_news['news_headline'].map(nltk_vader_score) df_news.head()
EDA of VADER Compound Scores
In this section, we will use ployly library to visualize the distribution of sentiment scores and also the percent of the three sentiment types from all news headlines.
import plotly.express as px fig = px.histogram( df_news, x='sentiment_score_vader', color='source').update_xaxes(categoryorder="total descending") fig.update_layout(xaxis_title='Sentiment Score (Compound from -1 to 1)', yaxis_title='Count', font=dict(size=16), bargap=0.025, width=790, height=520, legend=dict(orientation="h", yanchor="top", y=1.23, xanchor="center", x=0.48)) fig.show('notebook')
For sentiment type, we define the following custom function that labels the sentiment scores accordingly.
def sentiment_type(text): analyzer = SIA().polarity_scores(text) neg = analyzer['neg'] neu = analyzer['neu'] pos = analyzer['pos'] comp = analyzer['compound'] if neg > pos: return 'negative' elif pos > neg: return 'positive' elif pos == neg: return 'neutral'df_news['sentiment_type'] = df_news['news_headline'].map(sentiment_type)
Now we can plot a pie chart from the newly created column ‘sentiment_type’, which will show the percentage of each sentiment type for Amazon.
fig = px.pie(df_news, values=df_news['sentiment_type'].value_counts(normalize=True) * 100, names=df_news['sentiment_type'].unique(), color=df_news['sentiment_type'].unique(), hole=0.35, color_discrete_map={ 'neutral': 'silver', 'positive': 'mediumspringgreen', 'negative': 'orangered' })fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=22, hoverinfo='label+value', texttemplate = "%{label}<br>%{value:.0f}%")fig.update_layout(font=dict(size=16), width=810, height=520) fig.show('notebook')
News Headlines WordCloud
Lastly, we generate a WordCloud map on our News Headlines to provide a global look at the news scope.
from wordcloud import WordCloud, STOPWORDS def word_cloud(text): stopwords = set(STOPWORDS) allWords = ' '.join([nws for nws in text]) wordCloud = WordCloud( background_color='white', # black width=1600, height=800, stopwords=stopwords, min_font_size=20, max_font_size=150).generate(allWords) fig, ax = plt.subplots(figsize=(20, 10), facecolor='w') # facecolor='k' for black frame plt.imshow(wordCloud, interpolation='bilinear') ax.axis("off") fig.tight_layout(pad=0) plt.show() print('Wordcloud for ' + company_ticker) word_cloud(df_news['news_headline_tokens'].values)
As can be seen on the WordCloud map, the topic of union workers and their voting to unionize for Amazon is a main news topic that our tool has correctly picked up.
Company Events mentioned in the News
To extract certain company events from the news, we first need to get the text of each news article using requests
and BeautifulSoup
Because we want to scrape the news from various websites, the challenge is to get only the content of the news body (and not all the text within a news web link). One way is to use .body
as shown below, but we still get some text that are not part of the content of the news body. The advantage of this method is that we get a clean html text that does NOT need any regex post-processing.
soup = BeautifulSoup(html_text, 'lxml') tag = soup.body
Another method is to look at several news web links individually and see what the html class is for the content of the news body. Since our web scraping is dynamic (we get news from some well-known resources like yahoo finance or wsj, but the news sources can be anything depending on the date and company the user selects), our class list will not be exhaustive. Another downside of this method is that we need to clean the html text using regex post-processing.
soup = BeautifulSoup(html_text, 'lxml') body_content = soup.findAll('div', attrs={ 'class': [ 'caas-body', 'article-content-body-only', 'article__body', 'body', 'article-content rich-text' ] })
We select method 1 explained above, and create a custom function for our text extraction.
def get_article_text(Article_Link): import requests from bs4 import BeautifulSoup # using request package to make a GET request for the website, which means we're getting data from it. header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest" } html = requests.get(Article_Link, headers=header).content soup = BeautifulSoup(html) # Get the whole body tag tag = soup.body # Join each string recursively text = [] for string in tag.strings: # ignore if fewer than 15 words if len(string.split()) > 15: text.append(string) return ' '.join(text)
df_news['news_text'] = df_news['url'].map(get_article_text) # cleaning news_text by transforming anything that is NOT space, letters, or numbers to '' df_news['news_text'] = df_news['news_text'].apply(lambda x: re.sub('[^ a-zA-Z0-9]', '', x))
Now that we have the text of each news, we can proceed to extract important company events from the news, e.g. new products to launch, merger, acquisition, stock-related (buyback, split, …), hiring, or lay off. We can do this in a number of ways, one of the most popular being RegEx. But there is a Python library that could do the job more quickly and is much easier to work with, called FlashText
. We therefore define a custom function to use the FlashText
library.
def keyword_extractor(text): from flashtext import KeywordProcessor kwp = KeywordProcessor() keyword_dict = { 'new product': ['new product', 'new products'], 'M&A': ['merger', 'acquisition'], 'stock split/buyback': ['buyback', 'split'], 'workforce change': ['hire', 'hiring', 'firing', 'lay off', 'laid off'] } kwp.add_keywords_from_dict(keyword_dict) # we use set to get rid of repeating keywords, and ', '.join() to get string instead of SET data type: return ', '.join(set(kwp.extract_keywords(text)))
We then apply our function to create a new column containing our company event keywords.
df_news['event_keywords'] = df_news['news_text'].map(keyword_extractor)
Now, we can visualize the number of news articles containing company-event keywords for Amazon.
fig = px.histogram( df_news[df_news['event_keywords'] != ''], x='event_keywords', color='sentiment_type', color_discrete_map={ 'neutral': 'silver', 'positive': 'mediumspringgreen', 'negative': 'orangered' }).update_xaxes(categoryorder="total descending") fig.update_layout(yaxis_title='Count', xaxis_title='', width=810, height=620, font=dict(size=16), legend=dict(orientation="h", yanchor="top", y=1.16, xanchor="center", x=0.5)) fig.update_xaxes(tickangle=-45)
Disclaimer: The material in this article is purely educational and should not be taken as professional investment or any other advice. The information presented is just a snapshot.
References
- https://towardsdatascience.com/the-best-python-sentiment-analysis-package-1-huge-common-mistake-d6da9ad6cdeb
- https://pythoninvest.com/long-read/sentiment-analysis-of-financial-news
- https://www.kaggle.com/mmmarchetti/sentiment-analysis-on-financial-news
- https://medium.datadriveninvestor.com/scraping-live-stock-fundamental-ratios-news-and-more-with-python-a716329e0493
- https://tradewithpython.com/news-sentiment-analysis-using-python