Data gathering, assessing and cleaning stages are documented in wrangle_report.html.
Parts of the data analysis and visualisations are presented in a more "reader-friendly" way in act_report.html.
# imports
import os
import time
import requests
import pandas as pd
import tweepy
import json
import numpy as np
I downloaded twitter_archive_enhanced.csv
, uploaded it to the Project Workspace on Udacity and read it to the dataframe twitter_archive
with pandas
.
# loading twitter archive data
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head(1)
I wrote the code to download image_predictions.tsv
to Project Workspace directly and read it to the dataframe image_predictions
with pandas
.
# getting image prediction data file
image_prediction_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(image_prediction_url)
with open("image_predictions.tsv", mode = 'wb') as file:
file.write(r.content)
# loading image prediction data
image_predictions = pd.read_csv('image_predictions.tsv', sep = '\t')
image_predictions.head(1)
I wrote a script to get Twitter JSON data via API with Tweepy
library, using the list of Tweet IDs from twitter_archive
dataframe, and saved it to tweet_json.txt
file. I uploaded this file to the Project Workspace and added the code of the script to the project notebook without authentification keys. Since it would cause errors if left that way, I commented the cell that contains the code. I read the data from tweet_json.txt
to the dataframe tweet_jsons
, using json
and pandas
libraries.
# Twitter data gathering script. Uncomment and add your keys to run.
#tokens = {"consumer_key": "",
# "consumer_secret": "",
# "oauth_token": "",
# "oauth_token_secret": ""}
#
#consumer_key = tokens["consumer_key"]
#consumer_secret = tokens["consumer_secret"]
#oauth_token = tokens["oauth_token"]
#oauth_token_secret = tokens["oauth_token_secret"]
#
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(oauth_token, oauth_token_secret)
#api = tweepy.API(auth)
#
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(oauth_token, oauth_token_secret)
#api = tweepy.API(auth, wait_on_rate_limit = True)
#
#filename = 'tweet_json.txt'
#
#try:
# os.remove(filename)
#except OSError:
# pass
#
#tweet_errors = {}
#count = 0
#
#with open(filename, 'a') as f:
# for tweet_id in twitter_archive['tweet_id']:
# try:
# tweet = api.get_status(tweet_id, tweet_mode='extended')
# json.dump(tweet._json, f)
# f.write('\n')
# count += 1
# except tweepy.TweepError as e:
# print(tweet_id, e.args[0][0]['message'])
# tweet_errors[tweet_id] = e.reason
# time.sleep(1.2)
# if count % 100 == 0:
# print(count)
#
#print("Errors:", tweet_errors)
#print("Count:", str(count))
# script output: count
print("Count: 2340")
# script output: errors
errors = {888202515573088257: "[{'code': 144, 'message': 'No status found with that ID.'}]",
873697596434513921: "[{'code': 144, 'message': 'No status found with that ID.'}]",
872668790621863937: "[{'code': 144, 'message': 'No status found with that ID.'}]",
869988702071779329: "[{'code': 144, 'message': 'No status found with that ID.'}]",
866816280283807744: "[{'code': 144, 'message': 'No status found with that ID.'}]",
861769973181624320: "[{'code': 144, 'message': 'No status found with that ID.'}]",
845459076796616705: "[{'code': 144, 'message': 'No status found with that ID.'}]",
842892208864923648: "[{'code': 144, 'message': 'No status found with that ID.'}]",
837012587749474308: "[{'code': 144, 'message': 'No status found with that ID.'}]",
827228250799742977: "[{'code': 144, 'message': 'No status found with that ID.'}]",
812747805718642688: "[{'code': 144, 'message': 'No status found with that ID.'}]",
802247111496568832: "[{'code': 144, 'message': 'No status found with that ID.'}]",
775096608509886464: "[{'code': 144, 'message': 'No status found with that ID.'}]",
770743923962707968: "[{'code': 144, 'message': 'No status found with that ID.'}]",
754011816964026368: "[{'code': 144, 'message': 'No status found with that ID.'}]",
680055455951884288: "[{'code': 144, 'message': 'No status found with that ID.'}]"}
len(list(errors.keys()))
There are 16 tweets in the original Twitter archive data, which are now missing online. For other 2340 tweets the additional information on likes and retweets was gathered successfully.
# reading JSON data from the text file
json_list = []
with open('tweet_json.txt') as f:
for line in f.readlines():
a_json = json.loads(line)
json_list.append({'tweet_id': a_json['id'],
'favorite_count': a_json['favorite_count'],
'retweet_count': a_json['retweet_count']})
tweet_jsons = pd.DataFrame(json_list)
tweet_jsons.head()
# rearranging columns
tweet_jsons = tweet_jsons[['tweet_id', 'favorite_count', 'retweet_count']]
tweet_jsons.head()
twitter_archive.shape
twitter_archive.info()
There are 17 variable in twitter_archive
dataframe, first 10 of which are from the original Twitter data, and 7 were added later, based mostly on the content of the tweets.
For timestamp columns we can see wrong data types above. Also there are non-null values in columns indicating retweets and replies. Retweets should be excluded by the project guidelines, replies needs to be further assessed.
Since it is impossible for a dog to be in all stages simultaneously, we can assume, that in dog stage columns negative/missing options are encoded with strings, and not NaN
.
twitter_archive.head()
twitter_archive.tail()
twitter_archive.sample(10)
twitter_archive.source.value_counts()
The source
variable can be converted to category type, since it has limited number of values. However, the HTML information should be excluded for readabitily.
pd.set_option('display.max_colwidth', -1)
twitter_archive[twitter_archive.expanded_urls.notnull()][['tweet_id', 'expanded_urls']].sample(10)
As can be seen from the table above, some tweets have duplicated URLs in expanded_urls
column, which may come from entities
and extended_entities
JSON fields of original archive data.
twitter_archive.rating_denominator.value_counts()
twitter_archive.rating_numerator.value_counts().sort_index()
Though ingeneral rating is expected to be in M/N format, where N is 10 and M is below or slightly higher than 10, there are numbers in these two columns, that don't fit in. Theh will require further investigation during cleaning. Also these two columns should be turned into one rating
column by calculation to be used in further analysis.
pd.set_option('display.max_colwidth', -1)
twitter_archive[twitter_archive.in_reply_to_status_id.notnull()][["rating_numerator", "rating_denominator", "text"]]
As for replies, they sometimes lack images, sometimes contain additional information being a comment to an original @dog_rates tweet, sometimes are not about dogs. For consistency of information, it may be useful to exclude replies together with retweets.
twitter_archive.doggo.value_counts()
twitter_archive.pupper.value_counts()
twitter_archive.puppo.value_counts()
twitter_archive.floofer.value_counts()
As can be seen from the output above, the missing values are encoded with "None" in string format. Also pupper
, puppo
and doggo
columns may be combined in one dog_stages
column and used as a ordinal categorical variable with three levels.
twitter_archive[twitter_archive.name.notnull()].apply(lambda x: x['name']
if x['name'][0].islower() else "Names",
axis = 1).value_counts()
In the name
column there are many non-name words extracted from text and should be excluded. Still, some tweets may also contain names in text, but not where it was expected. This will require further investigation during data cleaning.
retweeted_status_id
and retweeted_status_user_id
, which means that these tweets are actually retweets, and this doesn't follow the project guidelines. floofer
and name
column encoded with "None" strings, and not pandas
NaN
values. timestamp
and retweeted_status_timestamp
columns not in datetime format. in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_id
and retweeted_status_user_id
columns in float format and scientific notation, but in case of removing retweets and replies, this won't need any additional actions for these columns together with retweeted_status_timestamp
will be columns with null values only and can be dropped. Name
column contain articles and other "non-name" words.source
column are links with HTML wrapped around the actual content, which doesn't improve readability. Also, the type of the column should be category. expanded_urls
column. pupper
, puppo
and doggo
- may be combined in one as the levels of one categorical variable. Still, dual values for many dogs in a picture may occur. rating_numerator
and rating_denominator
- should be used to calculate one rating value in float format to be used in analysis. image_predictions.shape
image_predictions.info()
image_predictions.head()
image_predictions.tail()
image_predictions.sample(10)
twitter_archive.shape[0] - image_predictions.shape[0]
In some cases it may by reasonable also to combine the predictions into three columns:
Number Of Prediction | Dog Breed | Confidence
but for the purpose of this project where such changes may lead to many rows with the same tweets, it seems unreasonable.
tweet_jsons.info()
tweet_jsons.head(20)
twitter_archive.shape[0] - tweet_jsons.shape[0]
The following steps need to be taken to clean and combine the data for further analysis.
twitter_archive
dataframe that correspond to retweets and replies. in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_id
and retweeted_status_user_id
columns. timestamp
column to datetime format. source
column and convert it to category type. expanded_urls
column. None
values in dog_stages
and name
with pandas
NaN
values. name
column and add the proper names, if any.name
column with NaN
values.pupper
, puppo
and doggo
columns in one dog_stages
column. dog_stages
for correctness. Combine the cleaned rating_numerator
and rating_denominator
columns in one rating
column in float format.
Join twitter_archive
dataframe with image_predictions
and tweet_jsons
dataframe on tweet_id
/id
columns, removing the rows which tweet IDs are not present in all three dataframes.
# copying the data for cleaning
archive_clean = twitter_archive.copy()
Since no modification intended of the other other dataframes and the assinging the merged dataframes to the one copied above won't affect them, there is no need tomake duplicated of them in memory.
mask = archive_clean.in_reply_to_status_id.isnull() & archive_clean.retweeted_status_id.isnull()
archive_clean = archive_clean.loc[mask, ]
archive_clean.shape
# test
archive_clean.info()
archive_clean = archive_clean.dropna(axis = 1, how = 'all')
archive_clean.shape
# test
archive_clean.info()
archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)
# test
assert archive_clean.timestamp.dtype == 'datetime64[ns]'
archive_clean.source = archive_clean.source.replace(r'^<a.*?>', '', regex = True)
archive_clean.source = archive_clean.source.replace('</a>', '', regex = True)
archive_clean.source.sample(3)
archive_clean.source.value_counts()
archive_clean.source = archive_clean.source.astype('category')
# test
assert archive_clean.source.dtype == 'category'
archive_clean[archive_clean.expanded_urls.notnull()].expanded_urls.head()
archive_clean.expanded_urls = archive_clean.apply(lambda x:
', '.join(set(x['expanded_urls'].split(',')))
if pd.notnull(x['expanded_urls']) else x['expanded_urls'],
axis = 1)
# test
archive_clean[archive_clean.expanded_urls.notnull()].expanded_urls.head()
archive_clean.iloc[: , -5:] = archive_clean.iloc[: , -5:].replace('None', np.nan)
# test
archive_clean.info()
Check if any names can be extracted from tweets with non-name words in 'name' column and add the proper names, if any.
Replace other non-name words in 'name' column with NaN.
archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name']
if x['name'][0].islower() else "Names",
axis = 1).value_counts()
not_names = (archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name']
if x['name'][0].islower() else "Names",
axis = 1).value_counts() < 60).index.tolist()[1:]
", ".join(not_names)
for index, row in archive_clean.iterrows():
if row['name'] in not_names:
print(index, row['text'])
In some tweets, where "non-name" words were extracted, there are names present after words "named" or "name is". These names can be extacted and added to the name
column. Other values should be replaced with NaN
.
def get_name(x, text):
"""
Function for extracting dog names from text field of a tweet,
if non-name word was extracted on previous iteration
"""
split_words = ['named ', 'name is ']
if x is np.nan or x[0].isupper():
return x
else:
split_word = ""
if split_words[0] in text:
split_word = split_words[0]
elif split_words[1] in text:
split_word = split_words[1]
else:
return np.nan
if split_word:
name = text.split(split_word)[1].split(' ')[0].replace('.', '')
return name
# Function test
print(get_name(archive_clean.name[1], archive_clean.text[1]), # Name
get_name(archive_clean.name[1878], archive_clean.text[1878]), # No name in text
get_name(archive_clean.name[2235], archive_clean.text[2235])) # Article instead of name
archive_clean.name = archive_clean.apply(lambda x: get_name(x['name'], x['text']),
axis = 1)
archive_clean.name.value_counts()
# test
assert len(archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name']
if x['name'][0].islower() else "Names",
axis = 1).value_counts().index.tolist()) == 1
archive_clean[['pupper', 'puppo', 'doggo']] = archive_clean[['pupper', 'puppo', 'doggo']].fillna('')
archive_clean[['pupper', 'puppo', 'doggo']].sample(10)
archive_clean['dog_stages'] = archive_clean.pupper.astype(str) + ',' + archive_clean.puppo +',' + archive_clean.doggo
archive_clean.dog_stages = archive_clean.dog_stages.replace(",,", np.nan)
archive_clean.iloc[: , -5:-1] = archive_clean.iloc[: , -5:-1].replace('', np.nan)
archive_clean.iloc[: , -5:].sample(5)
archive_clean.dog_stages = archive_clean.dog_stages.str.strip(",").replace(',,', ',', regex = True)
archive_clean.dog_stages.value_counts()
mask = archive_clean.dog_stages == 'puppo,doggo'
archive_clean[mask].text
As can be seen from the text, the stage sould be set to 'puppo'.
pd.options.mode.chained_assignment = None
archive_clean.dog_stages[191] = 'puppo'
archive_clean.dog_stages.value_counts()
mask = archive_clean.dog_stages == 'pupper,doggo'
archive_clean[mask].text
For the indexes above:
460 - no stage
531 - two dogs
575 - pupper
705 - doggo in text, but actually a hedgehog
733 - two dogs
889 - two dogs
956 - doggo in picture
1063 - two dogs
1113 - two dogs
archive_clean.dog_stages[460] = np.nan
archive_clean.dog_stages[575] = 'pupper'
archive_clean.dog_stages[705] = np.nan
archive_clean.dog_stages[956] = 'doggo'
archive_clean.dog_stages.value_counts()
mask = archive_clean.dog_stages == 'doggo'
archive_clean[mask].text
Of the following tweets:
363 This is Astrid. She's a guide doggo in training. 13/10 would follow anywhere https://t.co/xo7FZFIAao
389 This is Pilot. He has mastered the synchronized head tilt and sneaky tongue slip. Usually not unlocked until later doggo days. 12/10 https://t.co/YIV8sw8xkh
992 That is Quizno. This is his beach. He does not tolerate human shenanigans on his beach. 10/10 reclaim ur land doggo https://t.co/vdr7DaRSa7
363 is pupper, 298 is puppo and 992 is a horse.
archive_clean.dog_stages[363] = 'pupper'
archive_clean.dog_stages[389] = 'puppo'
archive_clean.dog_stages[992] = np.nan
archive_clean.dog_stages.value_counts()
mask = archive_clean.dog_stages == 'puppo'
archive_clean[mask].text
In these tweets the word "puppo" seems to be meaningful.
mask = archive_clean.dog_stages == 'pupper'
archive_clean[mask].text
# denominators
denom_not_10 = archive_clean.rating_denominator.value_counts().index.tolist()[1:]
mask = archive_clean.rating_denominator.isin(denom_not_10)
archive_clean[mask].text
There are two main types of mistakes.
Of the tweets above:
516 - no rating, should be excluded
1068 - wrong numbers taken for ratings, should be 14/10
1165 - wrong numbers taken for ratings, should be 13/10
1202 - wrong numbers taken for rating, should be 11/10
1662 - wrong numbers taken for rating, should be 10/10
2335 - wrong numbers taken for rating, should be 9/10
In other tweets ratings are "adjusted" by the number of dogs in the picture. Since the ratings will be used in float forms, this can be left as is for further division.
archive_clean = archive_clean.drop(516)
archive_clean.rating_numerator[1068] = 14
archive_clean.rating_denominator[1068] = 10
archive_clean.rating_numerator[1165] = 13
archive_clean.rating_denominator[1165] = 10
archive_clean.rating_numerator[1202] = 11
archive_clean.rating_denominator[1202] = 10
archive_clean.rating_numerator[1662] = 10
archive_clean.rating_denominator[1662] = 10
archive_clean.rating_numerator[2335] = 9
archive_clean.rating_denominator[2335] = 10
archive_clean.rating_denominator.value_counts()
checked_denominators = archive_clean.rating_denominator.value_counts().index.tolist()[1:]
mask = ~archive_clean.rating_denominator.isin(checked_denominators)
# numerators
archive_clean[mask].rating_numerator.value_counts()
mask_num = (archive_clean[mask].rating_numerator > 14)
archive_clean.loc[mask_num[mask_num == True].index, :][['tweet_id', 'text']]
There are several tweets where ratings are not in typical forms because of special occasions, like Christmas. Three last tweets may be dropped. Also not all numerators seem to be in integer format, may be useful to check for halves.
archive_clean = archive_clean.drop([979, 1712, 2074])
archive_clean.rating_numerator = archive_clean.rating_numerator.astype(float)
archive_clean.rating_numerator[695] = 9.75
archive_clean.rating_numerator[763] = 11.27
mask = archive_clean.rating_numerator == 5
archive_clean[mask].text
archive_clean.rating_numerator[45] = 13.5
archive_clean.rating_numerator.value_counts()
archive_clean['rating'] = archive_clean.rating_numerator / archive_clean.rating_denominator
archive_clean.rating.describe()
archive_clean = archive_clean[['tweet_id', 'timestamp', 'source',
'text', 'expanded_urls', 'name',
'floofer', 'dog_stages', 'rating']]
archive_clean.info()
twitter_archive_master = archive_clean.merge(image_predictions, on = 'tweet_id', suffixes = ('', '_imp'))
twitter_archive_master.info()
twitter_archive_master = twitter_archive_master.merge(tweet_jsons, on = 'tweet_id', suffixes = ('', '_jsons'))
twitter_archive_master.info()
# writing cleaned data to csv file
twitter_archive_master.to_csv('twitter_archive_master.csv', index = False)
A separate text-only report on data analysis in HTML format was recreated with R Markdown and includes a little less findings than the code here, for it was becoming too long. See it here.
# setting up graphics
import matplotlib.pyplot as plt
% matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
# loading cleaned data
df = pd.read_csv('twitter_archive_master.csv')
# fixing types
df['timestamp'] = pd.to_datetime(df.timestamp)
df['dog_stages'] = df.dog_stages.astype('category')
df['source'] = df.source.astype('category')
df = df.set_index('timestamp')
df.info()
Ok, Python. Who is the most favorited dog of all times at @dog_rates? At least in this dataset.
top_dog = df.loc[df.favorite_count.idxmax(), : ]
print("Tweet:", top_dog.text + "\n",
"Favorite count: ", str(top_dog.favorite_count) + "\n",
"Retweet_count:", top_dog.retweet_count)
from IPython.display import Image
from IPython.core.display import HTML
Image(url = top_dog.jpg_url)
It is actually a video. And maybe you should take a look, too. But I guess, I'm not the first who suggests that. By the way, the lowest rating received a screenshot from another Twitter account fot plagiarism. Do you agree?
df.loc[df.rating.idxmin(), ].text
Ok, let's be a bit more serious. The cleaned data set consists of 1965 rows and 22 variables, including data from the WeRateDogs Twitter archive, addtional Twitter data, gathered by API, and dog breed predictions, made by a neural network.
df.describe()
As can be seen from the summary statistics on favorites, with the mean favorite count of about 8741, our top dog is a real outlier. Same is true for the retweets - the mean is about 2651. The distributions seem to be noticeably right-skewed, we can change that with histograms.
df.favorite_count.hist(bins = 100);
df.retweet_count.hist(bins = 100);
timestamp = df.index
plt.hist(timestamp, bins = 100);
As can be seen from the plot above, WeRateDogs took a lot of effort to promote the account, posting quite frequently during the first months. We can see if it paid off with mean retweet and favorite counts per months.
plot = df.groupby([df.index.year, df.index.month]).retweet_count.mean().plot()
plot.set(xlabel = 'Time', ylabel = 'Count', title = 'Mean Retweet Count Per Month');
plot = df.groupby([df.index.year, df.index.month]).favorite_count.mean().plot()
plot.set(xlabel = 'Time', ylabel = 'Count', title = 'Mean Favorite Count Per Month');
df.rating.mean(), df.rating.median()
The median rating is 11/10 and the interquartile range is between 10/10 and 12/10.
df.rating.hist(bins = 20);
But it seems that a dog doesn't need to have the highest possible rating to be most popular - the highest favorite and retweet counts are in 13/10 group (see the plots below). Maybe, 14/10 is too subjective?
plot = df.plot.scatter(x = 'rating', y = 'favorite_count')
plot.set(xlabel = 'Rating', ylabel = 'Favorites', title = 'Favorites vs Rating');
plot = df.plot.scatter(x = 'rating', y = 'retweet_count')
plot.set(xlabel = 'Rating', ylabel = 'Retweets', title = 'Retweets vs Rating');
plot = df.plot.scatter(x = 'retweet_count', y = 'favorite_count')
plot.set(xlabel = 'Retweets', ylabel = 'Favorites', title = 'Favoriting & Retweeting');
The more retweets, the more likes. Did you expect that? Or should it be the other way around?
plot = df.boxplot(column = 'rating', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Rating', title = 'Rating By Dog Stages');
# This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
If you like puppies, I may have some bad news for you: their cuteness seems to win them on average lower rating, than the other stages demonstrate. The following two boxplots on ratings and favorites show the same tendency.
plot = df.boxplot(column = 'favorite_count', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Favorites', title = 'Favorites By Dog Stages');
# This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
plot = df.boxplot(column = 'retweet_count', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Retweets', title = 'Retweets By Dog Stages');
# This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
df.dog_stages.value_counts()
It's a pity that there is not enough data to judge if a pair of a dog with a pup really doing on average better than others. But we can use our subjective expert opinion here. Aren't they great?
parents = list(df[df.dog_stages == 'pupper,doggo'].jpg_url)
from skimage import io
imgs = []
for pair in parents:
imgs.append(io.imread(pair, 0))
plt.figure(figsize=(20,5))
columns = 4
for i, img in enumerate(imgs):
plt.subplot(len(imgs) / columns + 1, columns, i + 1)
plt.imshow(img)