Using NLP Libraries to Discover What the Internet Says about Omdena

An end-to-end NLP project building right from scraping the internet using NLP libraries to deploying the code in the form of a package in PIP

Rinki Nag

May 31, 2021

11 minutes read

Author: Rinki Nag

In this article, you will learn an end-to-end NLP project building right from scraping Google data, Twitter, etc. using NLP libraries. How to make a word cloud and sentiment analysis to run a summarization algorithm and finally deploy the code in the form of a package in PIP. The whole tutorial will run on an example to scrap the internet on news and posts, to discover what they say about Omdena, interesting, huh? Let’s get started.

This article is among an educational series dedicated to learning the applications of NLP.

Check also, NLP Data Preparation: From Regex to Word Cloud Packages and Data Visualization.

What is Web scraping?

Web scraping is the process of extracting data from several sources. We can extract it from several major sources such as Social media like Twitter, LinkedIn, etc. Only scrapping from these sites is hard, the best way to extract the data from them is by using APIs. Like Twitter provides a developer API that can extract any tweets based on the topic you want to search. The data collected is then used for several data analytics purposes.

Some of the use cases of web scraping are:

For Businesses / eCommerce: Market Analysis, Price Comparison, Competition Monitoring
For Marketing: Lead Generation
For branding research to know public sentiment and views

If you want to know the rules which have to be taken care of when web scraping is done please have a look at this article link.

Here we will try to scrape some data from social media and see what it says about Omdena.

A social media sentiment analysis tells you how people feel about your brand online. Rather than a simple count of mentions or comments, sentiment analysis considers emotions and opinions. It involves collecting and analyzing information in the posts people share about your brand on social media.

The brand sentiment (also called brand health) is determined through monitoring and analysis of brand mentions, comments, and reviews online. It is one of the components of a social listening strategy. Is one of the most demanded solutions by many companies for data scientists.

There are different sources and APIs right from Facebook to Instagram, RSS feeds to LinkedIn, Google web too, etc many sources. And most of these have verified APIs to fetch data.

What is the end outcome of this article?

In this article, you will learn how to fetch data from Google URL sources, and Twitter and RSS feeds, clean the data, make visualizations, run sentiment analysis and summarization algorithms on the extracted corpus of data and finally deploy the code as a package on Pip and make it open source.

Right from collecting the data to using NLP techniques and finally deploying will be covered in this article.

Getting data from news articles and other google URL sources

If we want to scrape articles from Google news, there are a few parameters that we can use to build a search query.

All Google search URLs start with https://www.google.com/search ?

# Making our google query ready
topic=”Omdena AI”
numResults=3000
url =”https://www.google.com/search?q="+topic+"&tbm=nws&hl=en&num="+str(numResults)

q — this is the query topic, i.e., q=AI if you’re searching for Artificial Intelligence news

hl — the interface language, i.e., hl=en for English

tbm — to be matched, here we need tbm=nws to search for news items.

There’s a whole lot of other things one can match.

For instance, app for applications, blg for blogs, bks for books, isch for images, plcs for places, vid for videos, shop for shopping, and rcp for recipes.

num — controls the number of results shown. If you only want 10 results shown, num=10

Start scrapping the articles using the URL we built above.

response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
results = soup.find_all(“div”, attrs = {“class”: “ZINbbc”})

Will parse the Html and take the data from div tags and store in results variable

Then extract the text from the variable and store it in the descriptions list as we are taking only descriptions from the news articles we have scraped earlier.

descriptions = []
for result in results:
    try:
        description = result.find("div", attrs={"class":"s3v9rd"}).get_text()
        if description != "": 
            descriptions.append(description)
    except:
        continue
text = "".join(descriptions)

Now the description has all sentences (i.e. descriptions) in a list format and if we join them and print to see the output, you can see output like

Omdena is used for labeling tasks in deep learning for tree identification’,

‘4 months ago · Omdena brings together larger groups of AI professionals to solve social problems ‘.

‘, reducing rural family violence during a famine’,

‘2 months ago · We teamed up with a company called Omdena that organizes AI for good challenges’,

‘ We recruited over 50 data scientists in order to work with’

What if we want to see the sentiment score for these extracted sentences?

Here we can use text blob for NLP tasks that are generic like sentiment analysis, POS tagging, etc.

TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.

More about text blob can be found in this link.

Will simply iterate our extracted sentences through the text blob object to get a sentiment score. Here the score is between 0 to 1, where 0 is negative and 1 is positive and 0.5 is neutral.

Here we simply make a text blob object and pass the sentences one by one through a loop and print the sentiment results.

We have used a simple method here which is quite beginner-friendly. More improvement can be done using ML(Naive Bayes etc ) or Deep learning algorithms (LSTM etc).

for sentence in g_df[‘Text’]:
    print(sentence )
    analysis = TextBlob(sentence )
    print(analysis.sentiment)

Getting data from Twitter

For getting Twitter tweets and other data from Twitter, you need a Twitter developer account to fetch data. And be cautious never share that with anyone or on any social platform.

You can apply for a Twitter developer account at the link

It will take 2–3 days for approval to make sure you answer the question properly for the approval request.

After you get your keys and other information on the email id you used for your approval request (make sure your email id is the same as your Twitter account ).

We will use tweepy to get tweets related to Omdena and append them in a list and at the same time also check the sentiment using text blob as we have done earlier in the above section .

# Step 1 - Authenticate
consumer_key= ''
consumer_secret= ''

access_token=''
access_token_secret=''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

#Step 3 - Retrieve Tweets
public_tweets = api.search('Omdena')
tweets_list=[]
for tweet in public_tweets:
    print(tweet.text)
    tweets_list.append(tweet.text)
    #Step 4 Perform Sentiment Analysis on Tweets
    analysis = TextBlob(tweet.text)
    print(analysis.sentiment)
    print("")

Cleaning the text and making WordCloud

I have made the functions that clean the text.

What do I mean by text cleaning?

Mostly the sentences have URLs, punctuation marks, special characters, brackets, etc which are not useful for our analysis.

Most cleaning is done for words that are repetitive and not useful for our analysis like the day, a, also, am, etc and even we can add our stop words also which are specific to our corpus(text ).

We will clean our text and use it for our word cloud visuals. If you want to learn more about NLP word cloud and text cleaning please read our previous article on the same link.

And join our text from google articles and tweets text and get the final corpus for the word cloud.

def clean_text(txt):
    text_title=removetitle(txt)
    text_brackets=removebrackets(text_title)
    text_clean=remove_accented_chars(text_brackets)
    text_clean=text_clean.lower()
    text_clean=remove_special_chars(text_clean)
    text_clean=remove_stopwords(text_clean)
    return text_clean
twitter_text_clean=clean_text(twitter_text)

final=text_clean+twitter_text_clean

After we get a cleaned final text corpus let us plot a word cloud

wordcloud = WordCloud(stopwords=STOPWORDS).generate(final)
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()

We are using this word cloud package to do so

But here we can see some words like ago, ai are repeating and are of no use, let us add those to the stopwords list.

### By adding some more stops words to the list

wordcloud = WordCloud(stopwords=set(list(STOPWORDS)+[‘day’,’ai’,’ago’,’hour’,’months’,’omdena’])).generate(final)
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()

And plot the word cloud again

Now we can see some good words highlighted like Omdena’s founder Rudradeb, build portfolio, and mainly data science, being the most highlighted words as Omdena builds scalable data science solutions and data science portfolio.

This helps you to get a glance at the corpus and main words which are most repeated overall.

After scraping the data, cleaning it, performing sentiment analysis on it, and making word clouds. Let us try to get a summary of it. It is very helpful when we have lots of data and we want a glance at the whole corpus in a few sentences.

Running a summarization algorithm for generating an overview of the whole corpus of data

Wikipedia definition is Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. In addition to text, images and videos can also be summarized

This technique is mostly used in news apps where you get the summary of the articles which gives you a glance at the article in a concise manner.

There are many algorithms that can be used to get summarization from large text corpus like Text Rank, sentence scoring, NLTK, and Gensim based techniques.

For that, we will clean the data for the whole corpus and make it in a single corpus and try different techniques on it.

Here will try a few of them :

1. Gensim summarizer

Parameters

text (str) — Given text.
ratio (float, optional) — Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
word_count (int or None, optional) — Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.
split (bool, optional) — If True, a list of sentences will be returned. Otherwise joined strings will be returned.

We have to simply pass the data in the summarizer and experiment with the above parameters and find the best output. I will show here two outputs where I tried different parameter tuning.

print(summarize(DOCUMENT, ratio=0.2, split=False))

And

print(summarize(DOCUMENT, word_count=75, split=False))

2. NLTK technique

We will try to use the simple NLTK technique and do summarization.

Here we simply clean the text and tokenize the corpus by removing stop words, spaces, etc and vectorize the text and take the top n number of sentences that are best.

What is the need to vectorize if you ask? Text Vectorization is the process of converting text into numerical representation, which when we normalize returns the sentences with higher importance on the top which we can use to select the best top sentences.

import numpy as np
import re

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characterswhitespaces
    doc = re.sub(r'[^a-zA-Zs]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)
norm_sentences = normalize_corpus(sentences)
norm_sentences[:3]

Conclusion

We found that this Gensim method seems the best method and gives better summarization in this case.

Please find the code file (Jupyter Notebook) at this link.

Deploying this code as a package to save efforts

Deployment is an essential and final step that helps data scientists to make the model or any AI work to be accessible to all, we have to deploy it.

First of all, we have to clean the code and make it in proper function format for making it ready to be deployed as a python PIP package. Yes we will deploy this function in PIP and you can share it with your friends and they can use it to perform basic NLP tasks as we are doing in this blog.

1. Make your account in https://pypi.org/ and remember your ID and password as it is required when you want to publish or update your package

2. I have cleaned the code and for packaging, we have to put all code in small chunks of functions and then put that in a file I have made a file named main.py and put all code in it and for packaging, we need __init__.py where we call the main file and if we have more codes it will be called here

Please look at the file structure and code at https://github.com/eaglewarrior/scrape_do_nlp

After our code is ready and makes sure the file structure is the same as I have made in Github

3. Do test the package locally before you upload on pip, have a look at my demo.ipynb where I did the same link.

4. Build setup.py, this is an important step, this requires just a few information like install_requires where you have to mention the external dependencies which are required for making this package run, then run this command after you finish writing the setup.py

python setup.py sdist bdist_wheel

This will build three folder build, dist and egg folder, which will be used to upload in PIP website

from setuptools import setup, find_packages

VERSION = ‘1.0.0’

DESCRIPTION = ‘This package will just take Twitter keys and topic you want to scrape and give summary and sentiment as output’

LONG_DESCRIPTION = ‘This package will scrape google and Twitter and if sentiment flag is on it will do sentiment analysis and give summarization as output, the package is modular enough and separate task can be done like only scraping only google text or Twitter text, etc ‘

from setuptools import setup, find_packages VERSION = '1.0.0'
DESCRIPTION = 'This package will just take twitter keys and topic you want to scrape and give summary and sentiment as output'
LONG_DESCRIPTION = 'This package will scrape google and twitter and if sentiment flag is on it will do sentiment analysis 
and give summarization as output, the package is modular enough and separate task can be done like only scraping 
only google text or twitter text etc ' 
# Setting up

setup(       
# the name must match the folder name 'verysimplemodule'        name="scrape_do_nlp",     
version=VERSION,        
author="Rinki Nag",        
author_email="",        url="https://github.com/eaglewarrior/scrape_do_nlp",        description=DESCRIPTION,        
long_description=LONG_DESCRIPTION,        
packages=find_packages(),        
install_requires=["requests","urllib","time","spacy","bs4","wordcloud","matplotlib","nltk","re",
                  "unicodedata","tweepy","textblob","pandas","numpy","gensim"], 
# add any additional packages that         
# needs to be installed along with your package. Eg: 'caer'                keywords=['python', 'first package'],        
classifiers= [            
"Development Status :: 3 - Alpha",            
"Intended Audience :: Education",            
"Programming Language :: Python :: 2",            
"Programming Language :: Python :: 3",            
"Operating System :: MacOS :: MacOS X",            
"Operating System :: Microsoft :: Windows",        
])

The setup.py can be found at the link.

5. The final step to upload the package in pip

Write this command and you will be asked for your username and password which you have used to sign up earlier

python -m twine upload dist/*

And you are done it will give a URL where u can find your package

Visit the website

Visit the URL and you will see the pip command to install the package

pip install scrape-do-nlp

The final step is to install and check if there are any issues in the package.

You can also perform an additional step before finally pushing the package in the main server of PIP, you can test it on the test server of PIP.

Future Improvements

We can add data collection from RSS feeds, Instagram, etc
Better summarization technique like Text Rank used by google may give good results
Better sentiment analysis technique like LSTM or other unsupervised technique could have been used to get better results

That’s all, now you have learned the basics of NLP and python package building and make your own NLP package, and share it with your friends.

You might also like

Using NLP Libraries to Discover What the Internet Says about Omdena

What is Web scraping?

What is the end outcome of this article?

Getting data from news articles and other google URL sources

What if we want to see the sentiment score for these extracted sentences?

Getting data from Twitter

Cleaning the text and making WordCloud

Running a summarization algorithm for generating an overview of the whole corpus of data

1. Gensim summarizer

2. NLTK technique

Conclusion

Deploying this code as a package to save efforts

Future Improvements

Want to work with us too?

Let us co-create the AI future

Using NLP Libraries to Discover What the Internet Says about Omdena

What is Web scraping?

Why are social media sentiments important for a brand?

Different resources to get data for making a social media scraping and analysis

What is the end outcome of this article?

Getting data from news articles and other google URL sources

What if we want to see the sentiment score for these extracted sentences?

Getting data from Twitter

Cleaning the text and making WordCloud

Running a summarization algorithm for generating an overview of the whole corpus of data

1. Gensim summarizer

2. NLTK technique

Conclusion

Deploying this code as a package to save efforts

Future Improvements

Want to work with us too?

Related Articles

Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning

Using Advanced Data Mining Techniques for Educational Leadership

AI-Powered Automated Content Moderation for a Social Media Platform

Let us co-create the AI future