Analytics Snippet: Impersonating Scott Morrison and Bill Shorten
While the 2019 Australian Federal Election can be a very serious matter, the young Actuaries Data Analytics Group (yDAWG) decided to have a bit of fun. Using a Markov Chain generator and existing library of historical tweets, they explain how synthetic tweets can impersonate our political leaders.
For background on this data analytics adventure into the twittersphere, see this article where the yDAWG authors outline their investigation into what the 2019 candidates, parties and the public are saying about the election.
Let’s get started!
First, let’s load the packages. Note that markovify is the key package used in our exercise.
From their GitHub we can see that Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.
As a reminder, to install this package you just need to use the following command:
pip install markovify
import pandas as pd pd.set_option("display.max_colwidth", 200) import warnings import markovify as mk warnings.filterwarnings("ignore", category=DeprecationWarning)
sm = pd.read_excel(r'ScottMorrisonMP20190512-0710.xlsx') bs = pd.read_excel('billshortenmp20190512-0716.xlsx')
sm.shape, bs.shape
Let’s have a quick glance of the dataset.
sm.head()
bs.head()
We will now add an extra column and concatenate the two dataframes
sm['candidate'] = 'Scott Morrison' bs['candidate'] = 'Bill Shorten' df = pd.concat([sm, bs]) df.head()
We confirmed that the number of tweets for each candidate remained as ~3200. We haven’t lost any data
df['candidate'].value_counts()
Now, we define a function which does the following tasks:
- For a given candidate, read in all tweet texts, and store all texts into a single list
- Make a text model using the list
- Make 8 short tweets (which are of course fake)
def tweet(tweeter): doc = df[df['candidate']==tweeter].text.tolist() text_model = mk.Text(doc) print('\n', tweeter) for i in range(8): print(text_model.make_short_sentence(140))
gt;tweet('Scott Morrison') tweet('Bill Shorten')
I will be keeping more of what you earn (by fake Scott Morrison)
The first fake tweet we made for Scott Morrison is really funny! I think it comes from this actual tweet:
We want you to keep more of what you earn (by real Scott Morrison)
Which is in row number 206 from our dataset.
sm['text'][206]
Let’s try it again!
tweet('Scott Morrison') tweet('Bill Shorten')
Labor will introduce a 15% GST. https://t.co/KmiyMiF9jm‘ (by fake Bill Shorten)
If you click the link, you will see that the real Bill Shorten said
There is nothing fair about increasing the GST, @TurnbullMalcolm . @AustralianLabor stands against a 15% GST.
Another total twist of words! The scary thing is, how realistic the fake tweets look. For someone who doesn’t follow politics, the phrasing and content are pretty believable
Let’s now force the fake tweets to start from certain words.
Let’s start with the word Australia.
def subj_tweet(tweeter, subject): doc = df[df['candidate']==tweeter].text.tolist() text_model = mk.Text(doc) print('\n', tweeter) for i in range(8): print(text_model.make_sentence_with_start(subject, strict=False))
subj_tweet('Scott Morrison', 'Australia') subj_tweet('Bill Shorten', 'Australia')
Next, try Climate
subj_tweet('Scott Morrison', 'Climate') subj_tweet('Bill Shorten', 'Climate')
Budget
subj_tweet('Scott Morrison', 'Budget') subj_tweet('Bill Shorten', 'Budget')
‘Budget surplus in 2021’ got a couple of times. Scott Morrison seems really confident about this. OK, OK, I got it.
Taxes
subj_tweet('Scott Morrison', 'Taxes') subj_tweet('Bill Shorten', 'Taxes')
Bill Shorten has nothing to comment on taxes…
I
subj_tweet('Scott Morrison', 'I') subj_tweet('Bill Shorten', 'I')
I seek to lead, ready to govern. (by fake Bill Shorten)
Very poetic!
We
subj_tweet('Scott Morrison', 'We') subj_tweet('Bill Shorten', 'We')
Great
subj_tweet('Scott Morrison', 'Great') subj_tweet('Bill Shorten', 'Great')
Conclusion
This exercise has proven to be a lot of fun. There are several things you can do to further improve the results:
- remove special characters
- use Spacy’s part of speech (POS) tagger to make better sounding sentences
- get more data
To me, it’s very scary to see how easy it is to create a fake twitter bot. In the age of data analytics, all it needs is some skills to pull down a few thousand tweets, and spend half an hour coding time, thanks to the great Python open source community.
In fact fake tweets have already been used in the 2016 US election and had a great impact, according to ANU’s research. Something to think about next time you read stuff from the internet!
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.