Facebook page news auto-poster

So you have learned a bit about machine learning and web crawling. Now you are excited to build something simple yet still useful. What can you make? Here is one idea.

Let’s say you managed a Facebook Page for your business or community and want to regularly post relevant news or articles from the internet. But since doing it manually is boring and now you have learned machine learning and web crawling, we can make a Facebook Page news auto-poster.

As you may notice from the diagram, Python is chosen for this project. The main reason is simply because there are some amazing opensource libraries that can simplify this project:

requests and beautifulsoup for web crawling
sklearn and sastrawi for text classification

The development is separated into 2 phases:

Preparation: create news crawlers to gather data, then use them to train text classification model using machine learning and NLP (Natural Language Processing) techniques
Execution: create two executables/jobs:
1. crawler.py: crawls latest news, filter/classify the relevant ones, and persist them in database
2. fb.py: read from the database and post the news links to a Facebook Page

Preparation

News Crawler

First, source of the news articles are need to be decided. In this case, three news portals (detik.com, liputan6.com and kompas.com) are chose. Then, a crawler is created for each of them. To make extending it easier, below abstract class is used:

from abc import ABCMeta, abstractmethod

class BaseCrawler:
    __metaclass__ = ABCMeta

    @abstractmethod
    def crawl(self, silent=False): raise NotImplementedError

In this project, article’s title is considered sufficient to determine whether an article is relevant or not. After all, many people nowadays only read news title then immediately make conclusion regardless of the content, right? ?

Text Classification

By running crawlers for several hours, few hundreds of news titles should be collected. Next, each news title should be manually labeled either as 1 (relevant) or 0 (irrelevant) and saved as CSV like this. This will be the dataset that will be used to train a text classification model.

There are many techniques available to train a text classification model. In this project, I choose to use classics ones (mainly due to limited production server computational resources):

Preprocessing: Stemming (using Sastrawi) and dictionary-based stopwords removal
Vectorization: TF-IDF vectorization (using Sklearn)
Classification: SGD Classifier, which is a Linear SVM with SGD (using Sklearn, again ?)

The complete code, train.py, is also including another technique that utilize Indonesian Pretrained Word2Vec (by Facebook) to determine word weight during TF-IDF vectorization (I got this code from another awesome blog post). When this vectorization technique is combined with RBF SVM classifier, it actually yields better result (~9% better) than the previous technique. But since I haven’t optimized the code for low RAM machine, I choose to use the previous technique.

The output of the training is a trained model in pickle format file. By saving the trained model in a file, we don’t have to train the model every time we want to classify a new input (news title).

Execution

Auto-Crawler-Classifier

The crawler.py is set to run periodically using cron to “produce” news titles (reusing previous crawlers and trained text classification models) and store them in sqlite3 database. Basically this script is mostly just reusing what we have done in preparation phase.

Auto-Poster

Finally, the fb.py is also set to run periodically to “consume” stored news titles by publishing them to Facebook Page using available API. Obviously crawler.py must run more often then fb.py to make sure there is enough supply of news titles (and links).

Conclusion

There you have it: a “simple” yet (hopefully) useful machine learning project. It should at least provide some experience on web crawling, text classification and Facebook API integration.

By the time this article is written, an instance of this project is still posting links in “Pantau Harga” Facebook Page (the contents are mainly in Indonesian). It personally helps me to keep the page “alive” with almost no effort (I still need to delete invalid/false-positive posts manually, though). So, I hope you find it useful as well. Cheers!

0 0 votes

Article Rating

Facebook

Google

Twitter