Data Wrangling Project

by Tatiana Kurilo

Data gathering, assessing and cleaning stages are documented in wrangle_report.html.
Parts of the data analysis and visualisations are presented in a more "reader-friendly" way in act_report.html.

Table of Contents

Data Gathering

In [1]:
# imports

import os
import time
import requests
import pandas as pd
import tweepy
import json
import numpy as np

Loading Twitter Archive Data Locally

I downloaded twitter_archive_enhanced.csv, uploaded it to the Project Workspace on Udacity and read it to the dataframe twitter_archive with pandas.

In [2]:
# loading twitter archive data

twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head(1)
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None

Downloading Image Prediction Data Programmatically

I wrote the code to download image_predictions.tsv to Project Workspace directly and read it to the dataframe image_predictions with pandas.

In [3]:
# getting image prediction data file

image_prediction_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(image_prediction_url)

with open("image_predictions.tsv", mode = 'wb') as file:
        file.write(r.content)
In [4]:
# loading image prediction data

image_predictions = pd.read_csv('image_predictions.tsv', sep = '\t')
image_predictions.head(1)
Out[4]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True

Gathering Information Via Twitter API

I wrote a script to get Twitter JSON data via API with Tweepy library, using the list of Tweet IDs from twitter_archive dataframe, and saved it to tweet_json.txt file. I uploaded this file to the Project Workspace and added the code of the script to the project notebook without authentification keys. Since it would cause errors if left that way, I commented the cell that contains the code. I read the data from tweet_json.txt to the dataframe tweet_jsons, using json and pandas libraries.

In [5]:
# Twitter data gathering script. Uncomment and add your keys to run.

#tokens = {"consumer_key": "",
#         "consumer_secret": "",
#         "oauth_token": "",
#         "oauth_token_secret": ""}
#
#consumer_key = tokens["consumer_key"]
#consumer_secret = tokens["consumer_secret"]
#oauth_token = tokens["oauth_token"]
#oauth_token_secret = tokens["oauth_token_secret"]
#
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(oauth_token, oauth_token_secret)
#api = tweepy.API(auth)
#
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(oauth_token, oauth_token_secret)
#api = tweepy.API(auth, wait_on_rate_limit = True)
#
#filename = 'tweet_json.txt'
#
#try:
#    os.remove(filename)
#except OSError:
#    pass
#
#tweet_errors = {}
#count = 0
#
#with open(filename, 'a') as f:
#    for tweet_id in twitter_archive['tweet_id']:
#        try:
#            tweet = api.get_status(tweet_id, tweet_mode='extended')
#            json.dump(tweet._json, f)
#            f.write('\n')
#            count += 1
#        except tweepy.TweepError as e:
#            print(tweet_id, e.args[0][0]['message'])
#            tweet_errors[tweet_id] = e.reason
#        time.sleep(1.2)
#        if count % 100 == 0:
#            print(count)
#
#print("Errors:", tweet_errors)

#print("Count:", str(count))
In [6]:
# script output: count

print("Count: 2340")
Count: 2340
In [7]:
# script output: errors

errors = {888202515573088257: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          873697596434513921: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          872668790621863937: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          869988702071779329: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          866816280283807744: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          861769973181624320: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          845459076796616705: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          842892208864923648: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          837012587749474308: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          827228250799742977: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          812747805718642688: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          802247111496568832: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          775096608509886464: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          770743923962707968: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          754011816964026368: "[{'code': 144, 'message': 'No status found with that ID.'}]", 
          680055455951884288: "[{'code': 144, 'message': 'No status found with that ID.'}]"}

len(list(errors.keys()))
Out[7]:
16

There are 16 tweets in the original Twitter archive data, which are now missing online. For other 2340 tweets the additional information on likes and retweets was gathered successfully.

In [8]:
# reading JSON data from the text file

json_list = []

with open('tweet_json.txt') as f:
    for line in f.readlines():
        a_json = json.loads(line)
        json_list.append({'tweet_id': a_json['id'], 
                            'favorite_count': a_json['favorite_count'], 
                            'retweet_count': a_json['retweet_count']})
    
tweet_jsons = pd.DataFrame(json_list)
tweet_jsons.head()
Out[8]:
favorite_count retweet_count tweet_id
0 37855 8260 892420643555336193
1 32526 6104 892177421306343426
2 24492 4041 891815181378084864
3 41205 8410 891689557279858688
4 39384 9109 891327558926688256
In [9]:
# rearranging columns

tweet_jsons = tweet_jsons[['tweet_id', 'favorite_count', 'retweet_count']]
tweet_jsons.head()
Out[9]:
tweet_id favorite_count retweet_count
0 892420643555336193 37855 8260
1 892177421306343426 32526 6104
2 891815181378084864 24492 4041
3 891689557279858688 41205 8410
4 891327558926688256 39384 9109

Data Assessing

WeRateDogs Twitter Archive Data

In [10]:
twitter_archive.shape
Out[10]:
(2356, 17)
In [11]:
twitter_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

There are 17 variable in twitter_archive dataframe, first 10 of which are from the original Twitter data, and 7 were added later, based mostly on the content of the tweets. For timestamp columns we can see wrong data types above. Also there are non-null values in columns indicating retweets and replies. Retweets should be excluded by the project guidelines, replies needs to be further assessed.
Since it is impossible for a dog to be in all stages simultaneously, we can assume, that in dog stage columns negative/missing options are encoded with strings, and not NaN.

In [12]:
twitter_archive.head()
Out[12]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
In [13]:
twitter_archive.tail()
Out[13]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
2351 666049248165822465 NaN NaN 2015-11-16 00:24:50 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a 1949 1st generation vulpix. Enj... NaN NaN NaN https://twitter.com/dog_rates/status/666049248... 5 10 None None None None None
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52 +0000 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... NaN NaN NaN https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54 +0000 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... NaN NaN NaN https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30 +0000 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... NaN NaN NaN https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None
2355 666020888022790149 NaN NaN 2015-11-15 22:32:08 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a Japanese Irish Setter. Lost eye... NaN NaN NaN https://twitter.com/dog_rates/status/666020888... 8 10 None None None None None
In [14]:
twitter_archive.sample(10)
Out[14]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
773 776249906839351296 NaN NaN 2016-09-15 02:42:54 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: We only rate dogs. Pls stop sen... 7.007478e+17 4.196984e+09 2016-02-19 18:24:26 +0000 https://twitter.com/dog_rates/status/700747788... 11 10 very None None None None
1449 696100768806522880 NaN NaN 2016-02-06 22:38:50 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This poor pupper has been stuck in a vortex si... NaN NaN NaN https://vine.co/v/i1KWj0vbvA9 10 10 None None None pupper None
2135 670061506722140161 NaN NaN 2015-11-27 02:08:07 +0000 <a href="http://twitter.com/download/iphone" r... This is Liam. He has a particular set of skill... NaN NaN NaN https://twitter.com/dog_rates/status/670061506... 11 10 Liam None None None None
2232 668221241640230912 NaN NaN 2015-11-22 00:15:33 +0000 <a href="http://twitter.com/download/iphone" r... These two dogs are Bo &amp; Smittens. Smittens... NaN NaN NaN https://twitter.com/dog_rates/status/668221241... 10 10 None None None None None
2149 669684865554620416 6.693544e+17 4.196984e+09 2015-11-26 01:11:28 +0000 <a href="http://twitter.com/download/iphone" r... After countless hours of research and hundreds... NaN NaN NaN NaN 11 10 None None None None None
1017 746872823977771008 NaN NaN 2016-06-26 01:08:52 +0000 <a href="http://twitter.com/download/iphone" r... This is a carrot. We only rate dogs. Please on... NaN NaN NaN https://twitter.com/dog_rates/status/746872823... 11 10 a None None None None
1595 686358356425093120 NaN NaN 2016-01-11 01:25:58 +0000 <a href="http://twitter.com/download/iphone" r... Heartwarming scene here. Son reuniting w fathe... NaN NaN NaN https://twitter.com/dog_rates/status/686358356... 10 10 None None None None None
1616 685198997565345792 NaN NaN 2016-01-07 20:39:06 +0000 <a href="http://twitter.com/download/iphone" r... This is Alfie. That is his time machine. He's ... NaN NaN NaN https://twitter.com/dog_rates/status/685198997... 11 10 Alfie None None None None
2300 667062181243039745 NaN NaN 2015-11-18 19:29:52 +0000 <a href="http://twitter.com/download/iphone" r... This is Keet. He is a Floridian Amukamara. Abs... NaN NaN NaN https://twitter.com/dog_rates/status/667062181... 10 10 Keet None None None None
1086 738166403467907072 NaN NaN 2016-06-02 00:32:39 +0000 <a href="http://twitter.com/download/iphone" r... This is Axel. He's a professional leaf catcher... NaN NaN NaN https://twitter.com/dog_rates/status/738166403... 12 10 Axel None None None None
In [15]:
twitter_archive.source.value_counts()
Out[15]:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

The source variable can be converted to category type, since it has limited number of values. However, the HTML information should be excluded for readabitily.

In [16]:
pd.set_option('display.max_colwidth', -1)

twitter_archive[twitter_archive.expanded_urls.notnull()][['tweet_id', 'expanded_urls']].sample(10)
Out[16]:
tweet_id expanded_urls
1452 695767669421768709 https://twitter.com/dog_rates/status/695767669421768709/photo/1
1526 690374419777196032 https://twitter.com/dog_rates/status/690374419777196032/photo/1
1377 701601587219795968 https://twitter.com/dog_rates/status/701601587219795968/photo/1
871 761599872357261312 https://twitter.com/dog_rates/status/761599872357261312/photo/1
343 832040443403784192 https://twitter.com/dog_rates/status/769940425801170949/photo/1,https://twitter.com/dog_rates/status/769940425801170949/photo/1,https://twitter.com/dog_rates/status/769940425801170949/photo/1,https://twitter.com/dog_rates/status/769940425801170949/photo/1
2307 666826780179869698 https://twitter.com/dog_rates/status/666826780179869698/photo/1
1588 686730991906516992 https://twitter.com/dog_rates/status/686730991906516992/photo/1
1040 744223424764059648 https://twitter.com/strange_animals/status/672108316018024452
990 748705597323898880 https://twitter.com/dog_rates/status/748705597323898880/video/1
1015 747103485104099331 https://twitter.com/dog_rates/status/747103485104099331/photo/1,https://twitter.com/dog_rates/status/747103485104099331/photo/1,https://twitter.com/dog_rates/status/747103485104099331/photo/1,https://twitter.com/dog_rates/status/747103485104099331/photo/1

As can be seen from the table above, some tweets have duplicated URLs in expanded_urls column, which may come from entities and extended_entities JSON fields of original archive data.

In [17]:
twitter_archive.rating_denominator.value_counts()
Out[17]:
10     2333
11     3   
50     3   
80     2   
20     2   
2      1   
16     1   
40     1   
70     1   
15     1   
90     1   
110    1   
120    1   
130    1   
150    1   
170    1   
7      1   
0      1   
Name: rating_denominator, dtype: int64
In [18]:
twitter_archive.rating_numerator.value_counts().sort_index()
Out[18]:
0       2  
1       9  
2       9  
3       19 
4       17 
5       37 
6       32 
7       55 
8       102
9       158
10      461
11      464
12      558
13      351
14      54 
15      2  
17      1  
20      1  
24      1  
26      1  
27      1  
44      1  
45      1  
50      1  
60      1  
75      2  
80      1  
84      1  
88      1  
99      1  
121     1  
143     1  
144     1  
165     1  
182     1  
204     1  
420     2  
666     1  
960     1  
1776    1  
Name: rating_numerator, dtype: int64

Though ingeneral rating is expected to be in M/N format, where N is 10 and M is below or slightly higher than 10, there are numbers in these two columns, that don't fit in. Theh will require further investigation during cleaning. Also these two columns should be turned into one rating column by calculation to be used in further analysis.

In [19]:
pd.set_option('display.max_colwidth', -1)

twitter_archive[twitter_archive.in_reply_to_status_id.notnull()][["rating_numerator", "rating_denominator", "text"]]
Out[19]:
rating_numerator rating_denominator text
30 12 10 @NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution
55 17 10 @roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s
64 14 10 @RealKentMurphy 14/10 confirmed
113 10 10 @ComplicitOwl @ShopWeRateDogs &gt;10/10 is reserved for dogs
148 12 10 @Jack_Septic_Eye I'd need a few more pics to polish a full analysis, but based on the good boy content above I'm leaning towards 12/10
149 14 10 Ladies and gentlemen... I found Pipsy. He may have changed his name to Pablo, but he never changed his love for the sea. Pupgraded to 14/10 https://t.co/lVU5GyNFen
179 12 10 @Marc_IRL pixelated af 12/10
184 14 10 THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY HI AFTER ALL. PUPGRADED TO A 14/10. WOULD BE AN HONOR TO FLY WITH https://t.co/p1hBHCmWnA
186 14 10 @xianmcguire @Jenna_Marbles Kardashians wouldn't be famous if as a society we didn't place enormous value on what they do. The dogs are very deserving of their 14/10
188 420 10 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
189 666 10 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
218 13 10 @markhoppus MARK THAT DOG HAS SEEN AND EXPERIENCED MANY THINGS. PROBABLY LOST OTHER EAR DOING SOMETHING HEROIC. 13/10 HUG THE DOG HOPPUS
228 11 10 Jerry just apuppologized to me. He said there was no ill-intent to the slippage. I overreacted I admit. Pupgraded to an 11/10 would pet
234 13 10 .@breaannanicolee PUPDATE: Cannon has a heart on his nose. Pupgraded to a 13/10
251 13 10 PUPDATE: I'm proud to announce that Toby is 236 days sober. Pupgraded to a 13/10. We're all very proud of you, Toby https://t.co/a5OaJeRl9B
274 10 10 @0_kelvin_0 &gt;10/10 is reserved for puppos sorry Kevin
290 182 10 @markhoppus 182/10
291 15 10 @bragg6of8 @Andy_Pace_ we are still looking for the first 15/10
313 960 0 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
342 11 15 @docmisterio account started on 11/15/15
346 12 10 @UNC can confirm 12/10
387 7 10 I was going to do 007/10, but the joke wasn't worth the &lt;10 rating
409 13 10 @HistoryInPics 13/10
427 13 10 @imgur for a polar bear tho I'd say 13/10 is appropriate
498 12 10 I've been informed by multiple sources that this is actually a dog elf who's tired from helping Santa all night. Pupgraded to 12/10
513 11 10 PUPDATE: I've been informed that Augie was actually bringing his family these flowers when he tripped. Very good boy. Pupgraded to 11/10
565 11 10 Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze
570 11 10 .@NBCSports OMG THE TINY HAT I'M GOING TO HAVE TO SAY 11/10 NBC
576 11 10 @SkyWilliams doggo simply protecting you from evil that which you cannot see. 11/10 would give extra pets
611 11 10 @JODYHiGHROLLER it may be an 11/10 but what do I know 😉
... ... ... ...
1479 11 10 Personally I'd give him an 11/10. Not sure why you think you're qualified to rate such a stellar pup.\n@CommonWhiteGirI
1497 9 10 PUPDATE: just noticed this dog has some extra legs. Very advanced. Revolutionary af. Upgraded to a 9/10
1501 13 10 These are some pictures of Teddy that further justify his 13/10 rating. Please enjoy https://t.co/tDkJAnQsbQ
1523 12 10 12/10 @LightningHoltt
1598 4 20 Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating
1605 14 10 Jack deserves another round of applause. If you missed this earlier today I strongly suggest reading it. Wonderful first 14/10 🐶❤️
1618 5 10 For those who claim this is a goat, u are wrong. It is not the Greatest Of All Time. The rating of 5/10 should have made that clear. Thank u
1630 12 10 After watching this video, we've determined that Pippa will be upgraded to a 12/10. Please enjoy https://t.co/IKoRK4yoxV
1634 143 130 Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
1663 20 16 I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
1689 5 10 I've been told there's a slight possibility he's checking his mirror. We'll bump to 9.5/10. Still a menace
1774 13 10 After getting lost in Reese's eyes for several minutes we're going to upgrade him to a 13/10
1819 7 10 After some outrage from the crowd. Bubbles is being upgraded to a 7/10. That's as high as I'm going. Thank you
1842 11 10 &amp; this is Yoshi. Another world record contender 11/10 (what the hell is happening why are there so many contenders?) https://t.co/QG708dDNH6
1844 9 10 This dog is being demoted to a 9/10 for not wearing a helmet while riding. Gotta stay safe out there. Thank you
1852 11 10 We've got ourselves a battle here. Watch out Reggie. 11/10 https://t.co/ALJvbtcwf0
1866 13 10 Yea I lied. Here's more. All 13/10 https://t.co/ZQZf2U4xCP
1882 13 10 Ok last one of these. I may try to make some myself. Anyway here ya go. 13/10 https://t.co/i9CDd1oEu8
1885 13 10 I have found another. 13/10 https://t.co/HwroPYv8pY
1892 12 10 Just received another perfect photo of dogs and the sunset. 12/10 https://t.co/9YmNcxA2Cc
1895 11 10 Some clarification is required. The dog is singing Cher and that is more than worthy of an 11/10. Thank you
1905 13 10 The 13/10 also takes into account this impeccable yard. Louis is great but the future dad in me can't ignore that luscious green grass
1914 13 10 13/10\n@ABC7
1940 1 10 The millennials have spoken and we've decided to immediately demote to a 1/10. Thank you
2036 13 10 I'm just going to leave this one here as well. 13/10 https://t.co/DaD5SyajWt
2038 1 10 After 22 minutes of careful deliberation this dog is being demoted to a 1/10. The longer you look at him the more terrifying he becomes
2149 11 10 After countless hours of research and hundreds of formula alterations we have concluded that Dug should be bumped to an 11/10
2169 10 10 This is Tessa. She is also very pleased after finally meeting her biological father. 10/10 https://t.co/qDS1aCqppv
2189 12 10 12/10 good shit Bubka\n@wane15
2298 10 10 After much debate this dog is being upgraded to 10/10. I repeat 10/10

78 rows × 3 columns

As for replies, they sometimes lack images, sometimes contain additional information being a comment to an original @dog_rates tweet, sometimes are not about dogs. For consistency of information, it may be useful to exclude replies together with retweets.

In [20]:
twitter_archive.doggo.value_counts()
Out[20]:
None     2259
doggo    97  
Name: doggo, dtype: int64
In [21]:
twitter_archive.pupper.value_counts()
Out[21]:
None      2099
pupper    257 
Name: pupper, dtype: int64
In [22]:
twitter_archive.puppo.value_counts()
Out[22]:
None     2326
puppo    30  
Name: puppo, dtype: int64
In [23]:
twitter_archive.floofer.value_counts()
Out[23]:
None       2346
floofer    10  
Name: floofer, dtype: int64

As can be seen from the output above, the missing values are encoded with "None" in string format. Also pupper, puppo and doggo columns may be combined in one dog_stages column and used as a ordinal categorical variable with three levels.

In [24]:
twitter_archive[twitter_archive.name.notnull()].apply(lambda x: x['name'] 
                                                  if x['name'][0].islower() else "Names", 
                                                  axis = 1).value_counts()
Out[24]:
Names           2247
a               55  
the             8   
an              7   
very            5   
quite           4   
just            4   
one             4   
actually        2   
getting         2   
mad             2   
not             2   
officially      1   
my              1   
this            1   
old             1   
infuriating     1   
his             1   
such            1   
incredibly      1   
unacceptable    1   
space           1   
all             1   
life            1   
by              1   
light           1   
dtype: int64

In the name column there are many non-name words extracted from text and should be excluded. Still, some tweets may also contain names in text, but not where it was expected. This will require further investigation during data cleaning.

Quality Issues In Twitter Archive Data

  1. 78 tweets are replies and can't be counted as tweets of "standart format" with an image, a text presenting the dog in the image and a rating number, as they often lack some of this informantion.
  2. 181 tweets have non-null values in retweeted_status_id and retweeted_status_user_id, which means that these tweets are actually retweets, and this doesn't follow the project guidelines.
  3. Missing values in dog stage columns, floofer and name column encoded with "None" strings, and not pandas NaN values.
  4. Values in timestamp and retweeted_status_timestamp columns not in datetime format.
  5. Values in in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns in float format and scientific notation, but in case of removing retweets and replies, this won't need any additional actions for these columns together with retweeted_status_timestamp will be columns with null values only and can be dropped.
  6. Some rating numerators are too large for "M/10" pattern - it is ok, when M is larger than 10 by some points, but not in times. Some are unexpectedly low.
  7. Some rating denominators isn't equal to 10.
  8. Name column contain articles and other "non-name" words.
  9. Values in source column are links with HTML wrapped around the actual content, which doesn't improve readability. Also, the type of the column should be category.
  10. Duplicated URLs in the same cells of expanded_urls column.

Tidiness Issues In Twitter Archive Data

  1. Dog stages columns - pupper, puppo and doggo - may be combined in one as the levels of one categorical variable. Still, dual values for many dogs in a picture may occur.
  2. Rating columns - rating_numerator and rating_denominator - should be used to calculate one rating value in float format to be used in analysis.

Image Prediction Data

In [25]:
image_predictions.shape
Out[25]:
(2075, 12)
In [26]:
image_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [27]:
image_predictions.head()
Out[27]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [28]:
image_predictions.tail()
Out[28]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False
In [29]:
image_predictions.sample(10)
Out[29]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
656 682259524040966145 https://pbs.twimg.com/media/CXffar9WYAArfpw.jpg 1 Siberian_husky 0.439670 True Eskimo_dog 0.340474 True malamute 0.101253 True
916 701545186879471618 https://pbs.twimg.com/media/CbxjnyOWAAAWLUH.jpg 1 Border_collie 0.280893 True Cardigan 0.112550 True toy_terrier 0.053317 True
1034 711732680602345472 https://pbs.twimg.com/media/CeCVGEbUYAASeY4.jpg 3 dingo 0.366875 False Ibizan_hound 0.334929 True Eskimo_dog 0.073876 True
344 672267570918129665 https://pbs.twimg.com/media/CVRfyZxWUAAFIQR.jpg 1 Irish_terrier 0.716932 True miniature_pinscher 0.051234 True Airedale 0.044381 True
782 690005060500217858 https://pbs.twimg.com/media/CZNj8N-WQAMXASZ.jpg 1 Samoyed 0.270287 True Great_Pyrenees 0.114027 True teddy 0.072475 False
1735 821765923262631936 https://pbs.twimg.com/media/C2d_vnHWEAE9phX.jpg 1 golden_retriever 0.980071 True Labrador_retriever 0.008758 True Saluki 0.001806 True
698 684567543613382656 https://pbs.twimg.com/media/CYASi6FWQAEQMW2.jpg 1 minibus 0.401942 False llama 0.229145 False seat_belt 0.209393 False
1245 747512671126323200 https://pbs.twimg.com/media/Cl-yykwWkAAqUCE.jpg 1 Cardigan 0.111493 True malinois 0.095089 True German_shepherd 0.080146 True
1737 821886076407029760 https://pbs.twimg.com/media/C2ftAxnWIAEUdAR.jpg 1 golden_retriever 0.266238 True cocker_spaniel 0.223325 True Irish_setter 0.151631 True
1125 727314416056803329 https://pbs.twimg.com/media/Chfwmd9U4AQTf1b.jpg 2 toy_poodle 0.827469 True miniature_poodle 0.160760 True Tibetan_terrier 0.001731 True
In [30]:
twitter_archive.shape[0] - image_predictions.shape[0]
Out[30]:
281
Quality Issues in Image Prediction Data
  • Missing information for 281 tweets in Twitter Archive Data.
  • Underscores in predictions may be changed to spaces for readability.
  • Predictions may be changed to category type.

In some cases it may by reasonable also to combine the predictions into three columns:

Number Of Prediction | Dog Breed | Confidence

but for the purpose of this project where such changes may lead to many rows with the same tweets, it seems unreasonable.


Additional Twitter Data

In [31]:
tweet_jsons.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 3 columns):
tweet_id          2340 non-null int64
favorite_count    2340 non-null int64
retweet_count     2340 non-null int64
dtypes: int64(3)
memory usage: 54.9 KB
In [32]:
tweet_jsons.head(20)
Out[32]:
tweet_id favorite_count retweet_count
0 892420643555336193 37855 8260
1 892177421306343426 32526 6104
2 891815181378084864 24492 4041
3 891689557279858688 41205 8410
4 891327558926688256 39384 9109
5 891087950875897856 19800 3027
6 890971913173991426 11572 2001
7 890729181411237888 63873 18344
8 890609185150312448 27212 4160
9 890240255349198849 31203 7176
10 890006608113172480 29986 7127
11 889880896479866881 27193 4837
12 889665388333682689 47038 9761
13 889638837579907072 26530 4400
14 889531135344209921 14777 2187
15 889278841981685760 24668 5214
16 888917238123831296 28474 4379
17 888804989199671297 24995 4165
18 888554962724278272 19385 3443
19 888078434458587136 21270 3395
In [33]:
twitter_archive.shape[0] - tweet_jsons.shape[0]
Out[33]:
16
Quality Issues in Additional Twitter Data
  • Missing information for 16 tweets in Twitter Archive Data: Twitter returned "No status found with that ID" message.

Data Cleaning

Define

The following steps need to be taken to clean and combine the data for further analysis.

  1. Identify and exclude the rows in twitter_archive dataframe that correspond to retweets and replies.
  2. Exclude in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns.
  3. Convert values in timestamp column to datetime format.
  4. Clean HTML information in source column and convert it to category type.
  5. Remove duplicated URLs from expanded_urls column.
  6. Replace None values in dog_stages and name with pandas NaN values.
  7. Check if any names can be extracted from tweets with non-name words in name column and add the proper names, if any.
  8. Replace other "non-name" values in name column with NaN values.
  9. Combine pupper, puppo and doggo columns in one dog_stages column.
  10. Check dog_stages for correctness.
  11. Explore the rating numerators and denominators to define if the ratings can be corrected or should be excluded.
  12. Combine the cleaned rating_numerator and rating_denominator columns in one rating column in float format.

  13. Join twitter_archive dataframe with image_predictions and tweet_jsons dataframe on tweet_id/id columns, removing the rows which tweet IDs are not present in all three dataframes.


Code & Test

In [34]:
# copying the data for cleaning
archive_clean = twitter_archive.copy()

Since no modification intended of the other other dataframes and the assinging the merged dataframes to the one copied above won't affect them, there is no need tomake duplicated of them in memory.


  1. Identify and exclude the rows in twitter archive dataframe that correspond to retweets and replies.
In [35]:
mask = archive_clean.in_reply_to_status_id.isnull() & archive_clean.retweeted_status_id.isnull()

archive_clean = archive_clean.loc[mask, ]
archive_clean.shape
Out[35]:
(2097, 17)
In [36]:
# test
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2097 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     2097 non-null object
source                        2097 non-null object
text                          2097 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2094 non-null object
rating_numerator              2097 non-null int64
rating_denominator            2097 non-null int64
name                          2097 non-null object
doggo                         2097 non-null object
floofer                       2097 non-null object
pupper                        2097 non-null object
puppo                         2097 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 294.9+ KB

  1. Exclude 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp' columns.
In [37]:
archive_clean = archive_clean.dropna(axis = 1, how = 'all')
archive_clean.shape
Out[37]:
(2097, 12)
In [38]:
# test
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null object
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  2097 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB

  1. Convert values in 'timestamp' column to datetime format.
In [39]:
archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)
In [40]:
# test

assert archive_clean.timestamp.dtype == 'datetime64[ns]'

  1. Clean HTML information in 'source' column and convert it to category type.
In [41]:
archive_clean.source = archive_clean.source.replace(r'^<a.*?>', '', regex = True)
archive_clean.source = archive_clean.source.replace('</a>', '', regex = True)
archive_clean.source.sample(3)  
Out[41]:
1070    Twitter for iPhone
1716    Twitter for iPhone
1087    Twitter for iPhone
Name: source, dtype: object
In [42]:
archive_clean.source.value_counts()
Out[42]:
Twitter for iPhone     1964
Vine - Make a Scene    91  
Twitter Web Client     31  
TweetDeck              11  
Name: source, dtype: int64
In [43]:
archive_clean.source = archive_clean.source.astype('category')
In [44]:
# test

assert archive_clean.source.dtype == 'category'

  1. Remove duplicated URLs from 'expanded_urls' column
In [45]:
archive_clean[archive_clean.expanded_urls.notnull()].expanded_urls.head()
Out[45]:
0    https://twitter.com/dog_rates/status/892420643555336193/photo/1                                                                
1    https://twitter.com/dog_rates/status/892177421306343426/photo/1                                                                
2    https://twitter.com/dog_rates/status/891815181378084864/photo/1                                                                
3    https://twitter.com/dog_rates/status/891689557279858688/photo/1                                                                
4    https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1
Name: expanded_urls, dtype: object
In [46]:
archive_clean.expanded_urls = archive_clean.apply(lambda x: 
                                                  ', '.join(set(x['expanded_urls'].split(','))) 
                                                  if pd.notnull(x['expanded_urls']) else x['expanded_urls'], 
                                                  axis = 1)
In [47]:
# test

archive_clean[archive_clean.expanded_urls.notnull()].expanded_urls.head()
Out[47]:
0    https://twitter.com/dog_rates/status/892420643555336193/photo/1
1    https://twitter.com/dog_rates/status/892177421306343426/photo/1
2    https://twitter.com/dog_rates/status/891815181378084864/photo/1
3    https://twitter.com/dog_rates/status/891689557279858688/photo/1
4    https://twitter.com/dog_rates/status/891327558926688256/photo/1
Name: expanded_urls, dtype: object

  1. Replace "None" values in dog stages columns and 'name' columns with NaN values.
In [48]:
archive_clean.iloc[: , -5:] = archive_clean.iloc[: , -5:].replace('None', np.nan)
In [49]:
# test

archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null category
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1494 non-null object
doggo                 83 non-null object
floofer               10 non-null object
pupper                230 non-null object
puppo                 24 non-null object
dtypes: category(1), datetime64[ns](1), int64(3), object(7)
memory usage: 198.8+ KB

  1. Check if any names can be extracted from tweets with non-name words in 'name' column and add the proper names, if any.

  2. Replace other non-name words in 'name' column with NaN.

In [50]:
archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name'] 
                                                  if x['name'][0].islower() else "Names", 
                                                  axis = 1).value_counts()
Out[50]:
Names           1390
a               55  
the             8   
an              6   
very            4   
one             4   
quite           3   
just            3   
actually        2   
not             2   
getting         2   
my              1   
officially      1   
old             1   
infuriating     1   
light           1   
all             1   
unacceptable    1   
this            1   
space           1   
mad             1   
life            1   
by              1   
such            1   
his             1   
incredibly      1   
dtype: int64
In [51]:
not_names = (archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name'] 
                                                  if x['name'][0].islower() else "Names", 
                                                  axis = 1).value_counts() < 60).index.tolist()[1:]
", ".join(not_names)
Out[51]:
'a, the, an, very, one, quite, just, actually, not, getting, my, officially, old, infuriating, light, all, unacceptable, this, space, mad, life, by, such, his, incredibly'
In [52]:
for index, row in archive_clean.iterrows():
    if row['name'] in not_names:
        print(index, row['text'])
22 I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba
56 Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af 
(IG: puffie_the_chow) https://t.co/ghXBIIeQZF
169 We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10 https://t.co/g2nSyGenG9
193 Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective https://t.co/Xc7uj1C64x
335 There's going to be a dog terminal at JFK Airport. This is not a drill. 10/10  
https://t.co/dp5h9bCwU7
369 Occasionally, we're sent fantastic stories. This is one of them. 14/10 for Grace https://t.co/bZ4axuH6OK
542 We only rate dogs. Please stop sending in non-canines like this Freudian Poof Lion. This is incredibly frustrating... 11/10 https://t.co/IZidSrBvhi
649 Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq
801 Guys this is getting so out of hand. We only rate dogs. This is a Galapagos Speed Panda. Pls only send dogs... 10/10 https://t.co/8lpAGaZRFn
819 We only rate dogs. Pls stop sending in non-canines like this Arctic Floof Kangaroo. This is very frustrating. 11/10 https://t.co/qlUDuPoE3d
852 This is my dog. Her name is Zoey. She knows I've been rating other dogs. She's not happy. 13/10 no bias at all https://t.co/ep1NkYoiwB
924 This is one of the most inspirational stories I've ever come across. I have no words. 14/10 for both doggo and owner https://t.co/I5ld3eKD5k
988 What jokester sent in a pic without a dog in it? This is not @rock_rates. This is @dog_rates. Thank you ...10/10 https://t.co/nDPaYHrtNX
992 That is Quizno. This is his beach. He does not tolerate human shenanigans on his beach. 10/10 reclaim ur land doggo https://t.co/vdr7DaRSa7
993 This is one of the most reckless puppers I've ever seen. How she got a license in the first place is beyond me. 6/10 https://t.co/z5bAdtn9kd
1002 This is a mighty rare blue-tailed hammer sherk. Human almost lost a limb trying to take these. Be careful guys. 8/10 https://t.co/TGenMeXreW
1004 Viewer discretion is advised. This is a terrible attack in progress. Not even in water (tragic af). 4/10 bad sherk https://t.co/L3U0j14N5R
1017 This is a carrot. We only rate dogs. Please only send in dogs. You all really should know this by now ...11/10 https://t.co/9e48aPrBm2
1025 This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt
1031 We only rate dogs. Pls stop sending in non-canines like this Jamaican Flop Seal. This is very very frustrating. 9/10 https://t.co/nc53zEN0hZ
1040 This is actually a pupper and I'd pet it so well. 12/10
https://t.co/RNqS7C4Y4N
1049 This is a very rare Great Alaskan Bush Pupper. Hard to stumble upon without spooking. 12/10 would pet passionately https://t.co/xOBKCdpzaa
1063 This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC
1071 This is getting incredibly frustrating. This is a Mexican Golden Beaver. We only rate dogs. Only send dogs ...10/10 https://t.co/0yolOOyD3X
1095 Say hello to mad pupper. You know what you did. 13/10 would pet until no longer furustrated https://t.co/u1ulQ5heLX
1097 We only rate dogs. Please stop sending in non-canines like this Alaskan Flop Turtle. This is very frustrating. 10/10 https://t.co/qXteK6Atxc
1120 Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
1121 We only rate dogs. Pls stop sending non-canines like this Bulgarian Eyeless Porch Bear. This is unacceptable... 9/10 https://t.co/2yctWAUZ3Z
1138 This is all I want in my life. 12/10 for super sleepy pupper https://t.co/4RlLA5ObMh
1193 People please. This is a Deadly Mediterranean Plop T-Rex. We only rate dogs. Only send in dogs. Thanks you... 11/10 https://t.co/2ATDsgHD4n
1206 This is old now but it's absolutely heckin fantastic and I can't not share it with you all. 13/10  https://t.co/wJX74TSgzP
1207 This is a taco. We only rate dogs. Please only send in dogs. Dogs are what we rate. Not tacos. Thank you... 10/10 https://t.co/cxl6xGY8B9
1259 We 👏🏻 only 👏🏻 rate 👏🏻 dogs. Pls stop sending in non-canines like this Dutch Panda Worm. This is infuriating. 11/10 https://t.co/odfLzBonG2
1340 Here is a heartbreaking scene of an incredible pupper being laid to rest. 10/10 RIP pupper https://t.co/81mvJ0rGRu
1351 Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
1361 This is a Butternut Cumberfloof. It's not windy they just look like that. 11/10 back at it again with the red socks https://t.co/hMjzhdUHaW
1362 This is an East African Chalupa Seal. We only rate dogs. Please only send in dogs. Thank you... 10/10 https://t.co/iHe6liLwWR
1368 This is a Wild Tuscan Poofwiggle. Careful not to startle. Rare tongue slip. One eye magical. 12/10 would def pet https://t.co/4EnShAQjv6
1382 "Pupper is a present to world. Here is a bow for pupper." 12/10 precious as hell https://t.co/ItSsE92gCW
1385 We only rate dogs. Pls stop sending in non-canines like this Mongolian grass snake. This is very frustrating. 11/10 https://t.co/22x9SbCYCU
1435 Please stop sending in saber-toothed tigers. This is getting ridiculous. We only rate dogs.
...8/10 https://t.co/iAeQNueou8
1457 This is just a beautiful pupper good shit evolution. 12/10 https://t.co/2L8pI0Z2Ib
1499 This is a rare Arctic Wubberfloof. Unamused by the happenings. No longer has the appetites. 12/10 would totally hug https://t.co/krvbacIX0N
1527 Stop sending in lobsters. This is the final warning. We only rate dogs. Thank you... 9/10 https://t.co/B9ZXXKJYNx
1603 This is the newly formed pupper a capella group. They're just starting out but I see tons of potential. 8/10 for all https://t.co/wbAcvFoNtn
1693 This is actually a lion. We only rate dogs. For the last time please only send dogs. Thank u.
12/10 would still pet https://t.co/Pp26dMQxap
1724 This is by far the most coordinated series of pictures I was sent. Downright impressive in every way. 12/10 for all https://t.co/etzLo3sdZE
1737 Guys this really needs to stop. We've been over this way too many times. This is a giraffe. We only rate dogs.. 7/10 https://t.co/yavgkHYPOC
1747 This is officially the greatest yawn of all time. 12/10 https://t.co/4R0Cc0sLVE
1785 This is a dog swinging. I really enjoyed it so I hope you all do as well. 11/10 https://t.co/Ozo9KHTRND
1797 This is the happiest pupper I've ever seen. 10/10 would trade lives with https://t.co/ep8ATEJwRb
1815 This is the saddest/sweetest/best picture I've been sent. 12/10 😢🐶 https://t.co/vQ2Lw1BLBF
1853 This is a Sizzlin Menorah spaniel from Brooklyn named Wylie. Lovable eyes. Chiller as hell. 10/10 and I'm out.. poof https://t.co/7E0AiJXPmI
1854 Seriously guys?! Only send in dogs. I only rate dogs. This is a baby black bear... 11/10 https://t.co/H7kpabTfLj
1877 C'mon guys. We've been over this. We only rate dogs. This is a cow. Please only submit dogs. Thank you...... 9/10 https://t.co/WjcELNEqN2
1878 This is a fluffy albino Bacardi Columbia mix. Excellent at the tweets. 11/10 would hug gently https://t.co/diboDRUuEI
1916 This is life-changing. 12/10 https://t.co/SroTpI6psB
1923 This is a Sagitariot Baklava mix. Loves her new hat. 11/10 radiant pup https://t.co/Bko5kFJYUU
1936 This is one esteemed pupper. Just graduated college. 10/10 what a champ https://t.co/nyReCVRiyd
1941 This is a heavily opinionated dog. Loves walls. Nobody knows how the hair works. Always ready for a kiss. 4/10 https://t.co/dFiaKZ9cDl
1955 This is a Lofted Aphrodisiac Terrier named Kip. Big fan of bed n breakfasts. Fits perfectly. 10/10 would pet firmly https://t.co/gKlLpNzIl3
1994 This is a baby Rand Paul. Curls for days. 11/10 would cuddle the hell out of https://t.co/xHXNaPAYRe
2001 This is light saber pup. Ready to fight off evil with light saber. 10/10 true hero https://t.co/LPPa3btIIt
2019 This is just impressive I have nothing else to say. 11/10 https://t.co/LquQZiZjJP
2030 This is space pup. He's very confused. Tries to moonwalk at one point. Super spiffy uniform. 13/10 I love space pup https://t.co/SfPQ2KeLdq
2034 This is a Tuscaloosa Alcatraz named Jacob (Yacōb). Loves to sit in swing. Stellar tongue. 11/10 look at his feet https://t.co/2IslQ8ZSc7
2037 This is the best thing I've ever seen so spread it like wildfire &amp; maybe we'll find the genius who created it. 13/10 https://t.co/q6RsuOVYwU
2066 This is a Helvetica Listerine named Rufus. This time Rufus will be ready for the UPS guy. He'll never expect it 9/10 https://t.co/34OhVhMkVr
2116 This is a Deciduous Trimester mix named Spork. Only 1 ear works. No seat belt. Incredibly reckless. 9/10 still cute https://t.co/CtuJoLHiDo
2125 This is a Rich Mahogany Seltzer named Cherokee. Just got destroyed by a snowball. Isn't very happy about it. 9/10 https://t.co/98ZBi6o4dj
2128 This is a Speckled Cauliflower Yosemite named Hemry. He's terrified of intruder dog. Not one bit comfortable. 9/10 https://t.co/yV3Qgjh8iN
2146 This is a spotted Lipitor Rumpelstiltskin named Alphred. He can't wait for the Turkey. 10/10 would pet really well https://t.co/6GUGO7azNX
2153 This is a brave dog. Excellent free climber. Trying to get closer to God. Not very loyal though. Doesn't bark. 5/10 https://t.co/ODnILTr4QM
2161 This is a Coriander Baton Rouge named Alfredo. Loves to cuddle with smaller well-dressed dog. 10/10 would hug lots https://t.co/eCRdwouKCl
2191 This is a Slovakian Helter Skelter Feta named Leroi. Likes to skip on roofs. Good traction. Much balance. 10/10 wow! https://t.co/Dmy2mY2Qj5
2198 This is a wild Toblerone from Papua New Guinea. Mouth always open. Addicted to hay. Acts blind. 7/10 handsome dog https://t.co/IGmVbz07tZ
2204 This is an Irish Rigatoni terrier named Berta. Completely made of rope. No eyes. Quite large. Loves to dance. 10/10 https://t.co/EM5fDykrJg
2211 Here is a horned dog. Much grace. Can jump over moons (dam!). Paws not soft. Bad at barking. 7/10 can still pet tho https://t.co/2Su7gmsnZm
2212 Never forget this vine. You will not stop watching for at least 15 minutes. This is the second coveted.. 13/10 https://t.co/roqIxCvEB3
2218 This is a Birmingham Quagmire named Chuk. Loves to relax and watch the game while sippin on that iced mocha. 10/10 https://t.co/HvNg9JWxFt
2222 Here is a mother dog caring for her pups. Snazzy red mohawk. Doesn't wag tail. Pups look confused. Overall 4/10 https://t.co/YOHe6lf09m
2235 This is a Trans Siberian Kellogg named Alfonso. Huge ass eyeballs. Actually Dobby from Harry Potter. 7/10 https://t.co/XpseHBlAAb
2249 This is a Shotokon Macadamia mix named Cheryl. Sophisticated af. Looks like a disappointed librarian. Shh (lol) 9/10 https://t.co/J4GnJ5Swba
2255 This is a rare Hungarian Pinot named Jessiga. She is either mid-stroke or got stuck in the washing machine. 8/10 https://t.co/ZU0i0KJyqD
2264 This is a southwest Coriander named Klint. Hat looks expensive. Still on house arrest :(
9/10 https://t.co/IQTOMqDUIe
2273 This is a northern Wahoo named Kohl. He runs this town. Chases tumbleweeds. Draws gun wicked fast. 11/10 legendary https://t.co/J4vn2rOYFk
2287 This is a Dasani Kingfisher from Maine. His name is Daryl. Daryl doesn't like being swallowed by a panda. 8/10 https://t.co/jpaeu6LNmW
2304 This is a curly Ticonderoga named Pepe. No feet. Loves to jet ski. 11/10 would hug until forever https://t.co/cyDfaK8NBc
2311 This is a purebred Bacardi named Octaviath. Can shoot spaghetti out of mouth. 10/10 https://t.co/uEvsGLOFHa
2314 This is a golden Buckminsterfullerene named Johm. Drives trucks. Lumberjack (?). Enjoys wall. 8/10 would hug softly https://t.co/uQbZJM2DQB
2326 This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t
2327 This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat). 7/10 https://t.co/LopTBkKa8h
2333 This is an extremely rare horned Parthenon. Not amused. Wears shoes. Overall very nice. 9/10 would pet aggressively https://t.co/QpRjllzWAL
2334 This is a funny dog. Weird toes. Won't come down. Loves branch. Refuses to eat his food. Hard to cuddle with. 3/10 https://t.co/IIXis0zta0
2335 This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
2345 This is the happiest dog you will ever see. Very committed owner. Nice couch. 10/10 https://t.co/RhUEAloehK
2346 Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p
2347 My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O
2348 Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt
2349 This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc
2350 This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe
2352 This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx
2353 Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR
2354 This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI

In some tweets, where "non-name" words were extracted, there are names present after words "named" or "name is". These names can be extacted and added to the name column. Other values should be replaced with NaN.

In [53]:
def get_name(x, text):
    """
    Function for extracting dog names from text field of a tweet, 
    if non-name word was extracted on previous iteration
    """
    split_words = ['named ', 'name is ']
    
    if x is np.nan or x[0].isupper():
        return x
    else:
        split_word = ""
        if split_words[0] in text:
            split_word = split_words[0]
        elif split_words[1] in text:
            split_word = split_words[1]
        else:
            return np.nan
        
        if split_word:
            name = text.split(split_word)[1].split(' ')[0].replace('.', '')
            
            return name

# Function test 

print(get_name(archive_clean.name[1], archive_clean.text[1]), # Name
      get_name(archive_clean.name[1878], archive_clean.text[1878]), # No name in text
      get_name(archive_clean.name[2235], archive_clean.text[2235])) # Article instead of name
Tilly nan Alfonso
In [54]:
archive_clean.name = archive_clean.apply(lambda x: get_name(x['name'], x['text']), 
                                                  axis = 1)
In [55]:
archive_clean.name.value_counts()
Out[55]:
Lucy         11
Charlie      11
Cooper       10
Oliver       10
Tucker       9 
Penny        9 
Winston      8 
Lola         8 
Sadie        8 
Toby         7 
Daisy        7 
Stanley      6 
Oscar        6 
Bo           6 
Bailey       6 
Jax          6 
Koda         6 
Bella        6 
Leo          5 
Scout        5 
Chester      5 
Buddy        5 
Dave         5 
Louis        5 
Milo         5 
Bentley      5 
Rusty        5 
Archie       4 
Gus          4 
Winnie       4 
            .. 
Chloe        1 
Milky        1 
Shaggy       1 
Hercules     1 
Darby        1 
Skittle      1 
Brady        1 
Peanut       1 
Flash        1 
Harnold      1 
Geoff        1 
Ed           1 
Vixen        1 
Derby        1 
Charleson    1 
Rorie        1 
Jerome       1 
Rodney       1 
Champ        1 
Shiloh       1 
Ebby         1 
Kulet        1 
Iggy         1 
Marlee       1 
Rooney       1 
Covach       1 
Blue         1 
Obie         1 
Burt         1 
Edmund       1 
Name: name, Length: 947, dtype: int64
In [56]:
# test

assert len(archive_clean[archive_clean.name.notnull()].apply(lambda x: x['name'] 
                                                  if x['name'][0].islower() else "Names", 
                                                  axis = 1).value_counts().index.tolist()) == 1

  1. Combine 'pupper', 'puppo' and 'doggo' columns in one 'dog_stages' column.
In [57]:
archive_clean[['pupper', 'puppo', 'doggo']] = archive_clean[['pupper', 'puppo', 'doggo']].fillna('')
archive_clean[['pupper', 'puppo', 'doggo']].sample(10)
Out[57]:
pupper puppo doggo
1589 pupper
1691
2102
816
502
636
94 puppo
400
1997
1094
In [58]:
archive_clean['dog_stages'] = archive_clean.pupper.astype(str) + ',' + archive_clean.puppo +',' + archive_clean.doggo

archive_clean.dog_stages = archive_clean.dog_stages.replace(",,", np.nan)
archive_clean.iloc[: , -5:-1] = archive_clean.iloc[: , -5:-1].replace('', np.nan)
In [59]:
archive_clean.iloc[: , -5:].sample(5)
Out[59]:
doggo floofer pupper puppo dog_stages
542 NaN NaN NaN NaN NaN
259 NaN NaN NaN NaN NaN
29 NaN NaN pupper NaN pupper,,
802 NaN NaN pupper NaN pupper,,
412 NaN NaN NaN NaN NaN
In [62]:
archive_clean.dog_stages = archive_clean.dog_stages.str.strip(",").replace(',,', ',',  regex = True)
In [63]:
archive_clean.dog_stages.value_counts()
Out[63]:
pupper          221
doggo           73 
puppo           23 
pupper,doggo    9  
puppo,doggo     1  
Name: dog_stages, dtype: int64

  1. Check 'dog_stages' for correctness.
In [63]:
mask = archive_clean.dog_stages == 'puppo,doggo'

archive_clean[mask].text
Out[63]:
191    Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel
Name: text, dtype: object

As can be seen from the text, the stage sould be set to 'puppo'.

In [64]:
pd.options.mode.chained_assignment = None

archive_clean.dog_stages[191] = 'puppo'

archive_clean.dog_stages.value_counts()
Out[64]:
pupper          221
doggo           73 
puppo           24 
pupper,doggo    9  
Name: dog_stages, dtype: int64
In [65]:
mask = archive_clean.dog_stages == 'pupper,doggo'

archive_clean[mask].text
Out[65]:
460     This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7
531     Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho                    
575     This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj                    
705     This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd
733     Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u                                                                                                          
889     Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll                    
956     Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8                            
1063    This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC                                                                         
1113    Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda                                                                                          
Name: text, dtype: object

For the indexes above:
460 - no stage
531 - two dogs
575 - pupper
705 - doggo in text, but actually a hedgehog
733 - two dogs
889 - two dogs
956 - doggo in picture
1063 - two dogs
1113 - two dogs

In [66]:
archive_clean.dog_stages[460] = np.nan
archive_clean.dog_stages[575] = 'pupper'
archive_clean.dog_stages[705] = np.nan
archive_clean.dog_stages[956] = 'doggo'
In [67]:
archive_clean.dog_stages.value_counts()
Out[67]:
pupper          222
doggo           74 
puppo           24 
pupper,doggo    5  
Name: dog_stages, dtype: int64
In [68]:
mask = archive_clean.dog_stages == 'doggo'

archive_clean[mask].text
Out[68]:
9       This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A         
43      Meet Yogi. He doesn't have any important dog meetings today he just enjoys looking his best at all times. 12/10 for dangerously dapper doggo https://t.co/YSI00BzTBZ  
99      Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR          
108     This is Napolean. He's a Raggedy East Nicaraguan Zoom Zoom. Runs on one leg. Built for deception. No eyes. Good with kids. 12/10 great doggo https://t.co/PR7B7w1rUw  
110     Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH                                                                                                                     
121     This is Scout. He just graduated. Officially a doggo now. Have fun with taxes and losing sight of your ambitions. 12/10 would throw cap for https://t.co/DsA2hwXAJo   
172     I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq    
200     At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk  
240     This is Barney. He's an elder doggo. Hitches a ride when he gets tired. Waves goodbye before he leaves. 13/10 please come back soon https://t.co/cFAasDXauK           
248     Say hello to Mimosa. She's an emotional support doggo who helps her owner with PTSD. 13/10, but she needs your help\n\nhttps://t.co/L6mLzrd7Mx https://t.co/jMutBFdw5o
300     This is Meera. She just heard about taxes and how much a doghouse in a nice area costs. Not pupared to be a  doggo anymore. 12/10 https://t.co/GZmNEdyoJY             
318     Here's a doggo fully pupared for a shower. H*ckin exquisite balance. Sneaky tongue slip too. 13/10 https://t.co/UtEVnQ1ZPg                                            
323     DOGGO ON THE LOOSE I REPEAT DOGGO ON THE LOOSE 10/10 https://t.co/ffIH2WxwF0                                                                                          
331     This is Rhino. He arrived at a shelter with an elaborate doggo manual for his new family, written by someone who will always love him. 13/10 https://t.co/QX1h0oqMz0  
339     Say hello to Smiley. He's a blind therapy doggo having a h*ckin blast high steppin around in the snow. 14/10 would follow anywhere https://t.co/SHAb1wHjMz            
344     This is Miguel. He was the only remaining doggo at the adoption center after the weekend. Let's change that. 12/10\n\nhttps://t.co/P0bO8mCQwN https://t.co/SU4K34NT4M 
345     This is Emanuel. He's a h*ckin rare doggo. Dwells in a semi-urban environment. Round features make him extra collectible. 12/10 would so pet https://t.co/k9bzgyVdUT  
351     This is Pete. He has no eyes. Needs a guide doggo. Also appears to be considerably fluffy af. 12/10 would hug softly https://t.co/Xc0gyovCtK                          
362     Here's a stressed doggo. Had a long day. Many things on her mind. The hat communicates these feelings exquisitely. 11/10 https://t.co/fmRS43mWQB                      
363     This is Astrid. She's a guide doggo in training. 13/10 would follow anywhere https://t.co/xo7FZFIAao                                                                  
372     Meet Doobert. He's a deaf doggo. Didn't stop him on the field tho. Absolute legend today. 14/10 would pat head approvingly https://t.co/iCk7zstRA9                    
384     This is Loki. He smiles like Elvis. Ain't nothin but a hound doggo. 12/10 https://t.co/QV5nx6otZR                                                                     
385     This is Cupid. He was found in the trash. Now he's well on his way to prosthetic front legs and a long happy doggo life. 13/10 heroic af https://t.co/WS0Gha8vRh      
389     This is Pilot. He has mastered the synchronized head tilt and sneaky tongue slip. Usually not unlocked until later doggo days. 12/10 https://t.co/YIV8sw8xkh          
391     Here's a little more info on Dew, your favorite roaming doggo that went h*ckin viral. 13/10 \nhttps://t.co/1httNYrCeW https://t.co/KvaM8j3jhX                         
423     This is Duchess. She uses dark doggo forces to levitate her toys. 13/10 magical af https://t.co/maDNMETA52                                                            
426     This is Sundance. He's a doggo drummer. Even sings a bit on the side. 14/10 entertained af (vid by @sweetsundance) https://t.co/Xn5AQtiqzG                            
429     Here's a doggo who looks like he's about to give you a list of mythical ingredients to go collect for his potion. 11/10 would obey https://t.co/8SiwKDlRcl            
440     Here we have a doggo who has messed up. He was hoping you wouldn't notice. 11/10 someone help him https://t.co/XdRNXNYD4E                                             
448     This is Sunny. She was also a very good First Doggo. 14/10 would also be an absolute honor to pet https://t.co/YOC1fHFCSb                                             
                                                                  ...                                                                                                         
780     This is Anakin. He strives to reach his full doggo potential. Born with blurry tail tho. 11/10 would still pet well https://t.co/9CcBSxCXXG                           
782     This is Finley. He's an independent doggo still adjusting to life on his own. 11/10 https://t.co/7FNcBaKbci                                                           
807     Doggo will persevere. 13/10\nhttps://t.co/yOVzAomJ6k                                                                                                                  
835     Meet Gerald. He's a fairly exotic doggo. Floofy af. Inadequate knees tho. Self conscious about large forehead. 8/10 https://t.co/WmczvjCWJq                           
839     I don't know any of the backstory behind this picture but for some reason I'm crying. 13/10 for owner and doggo https://t.co/QOKZdus9TT                               
877     This is Wishes. He has the day off. Daily struggles of being a doggo have finally caught up with him. 11/10 https://t.co/H9YgrUkYwa                                   
881     Doggo want what doggo cannot have. Temptation strong, dog stronger. 12/10  https://t.co/IqyTF6qik6                                                                    
899     This doggo is just waiting for someone to be proud of her and her accomplishment. 13/10 legendary af https://t.co/9T2h14yn4Q                                          
914     Here's a doggo completely oblivious to the double rainbow behind him. 10/10 someone tell him https://t.co/OfvRoD6ndV                                                  
919     All hail sky doggo. 13/10 would jump super high to pet https://t.co/CsLRpqdeTF                                                                                        
924     This is one of the most inspirational stories I've ever come across. I have no words. 14/10 for both doggo and owner https://t.co/I5ld3eKD5k                          
944     Nothing better than a doggo and a sunset. 10/10 majestic af https://t.co/xVSodF19PS                                                                                   
945     Hooman used Pokeball\n*wiggle*\n*wiggle*\nDoggo broke free \n10/10 https://t.co/bWSgqnwSHr                                                                            
948     Here's a doggo trying to catch some fish. 8/10 futile af (vid by @KellyBauerx) https://t.co/jwd0j6oWLE                                                                
956     Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8                              
977     Meet Piper. She's an airport doggo. Please return your tray table to its full pupright and locked position. 11/10 https://t.co/D17IAcetmM                             
985     This is Boomer. He's self-baptizing. Other doggo not ready to renounce sins. 11/10 spiritually awakened af https://t.co/cRTJiQQk9o                                    
989     Say hello to Divine Doggo. Must be magical af. 13/10 would be an honor to pet https://t.co/BbcABzohKb                                                                 
992     That is Quizno. This is his beach. He does not tolerate human shenanigans on his beach. 10/10 reclaim ur land doggo https://t.co/vdr7DaRSa7                           
1030    This is Lenox. She's in a wheelbarrow. Silly doggo. You don't belong there. 10/10 would push around https://t.co/oYbVR4nBsR                                           
1039    Here's a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad) https://t.co/7wE9LTEXC4                                                   
1051    For anyone who's wondering, this is what happens after a doggo catches it's tail... 11/10 https://t.co/G4fNhzelDv                                                     
1075    Here's a doggo that don't need no human. 12/10 independent af (vid by @MichelleLiuCee) https://t.co/vdgtdb6rON                                                        
1079    Here's a doggo blowing bubbles. It's downright legendary. 13/10 would watch on repeat forever (vid by Kent Duryee) https://t.co/YcXgHfp1EC                            
1103    This is Kellogg. He accidentally opened the front facing camera. 8/10 get it together doggo https://t.co/MRYv7nDPyS                                                   
1117    This is Kyle (pronounced 'Mitch'). He strives to be the best doggo he can be. 11/10 would pat on head approvingly https://t.co/aA2GiTGvlE                             
1141    Here's a doggo struggling to cope with the winds. 13/10 https://t.co/qv3aUwaouT                                                                                       
1156    Nothin better than a doggo and a sunset. 11/10 https://t.co/JlFqOhrHEs                                                                                                
1176    This doggo was initially thrilled when she saw the happy cartoon pup but quickly realized she'd been deceived. 10/10 https://t.co/mvnBGaWULV                          
1204    Here's a super majestic doggo and a sunset 11/10 https://t.co/UACnoyi8zu                                                                                              
Name: text, Length: 74, dtype: object

Of the following tweets:
363 This is Astrid. She's a guide doggo in training. 13/10 would follow anywhere https://t.co/xo7FZFIAao
389 This is Pilot. He has mastered the synchronized head tilt and sneaky tongue slip. Usually not unlocked until later doggo days. 12/10 https://t.co/YIV8sw8xkh
992 That is Quizno. This is his beach. He does not tolerate human shenanigans on his beach. 10/10 reclaim ur land doggo https://t.co/vdr7DaRSa7

363 is pupper, 298 is puppo and 992 is a horse.

In [69]:
archive_clean.dog_stages[363] = 'pupper'
archive_clean.dog_stages[389] = 'puppo'
archive_clean.dog_stages[992] = np.nan
In [70]:
archive_clean.dog_stages.value_counts()
Out[70]:
pupper          223
doggo           71 
puppo           25 
pupper,doggo    5  
Name: dog_stages, dtype: int64
In [71]:
mask = archive_clean.dog_stages == 'puppo'

archive_clean[mask].text
Out[71]:
12      Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm                                    
14      This is Stuart. He's sporting his favorite fanny pack. Secretly filled with bones only. 13/10 puppared puppo #BarkWeek https://t.co/y70o6h3isq                        
71      This is Snoopy. He's a proud #PrideMonthPuppo. Impeccable handwriting for not having thumbs. 13/10 would love back #PrideMonth https://t.co/lNZwgNO4gS                
94      This is Sebastian. He can't see all the colors of the rainbow, but he can see that this flag makes his human happy. 13/10 #PrideMonth puppo https://t.co/XBE0evJZ6V   
129     This is Shikha. She just watched you drop a skittle on the ground and still eat it. Could not be less impressed. 12/10 superior puppo https://t.co/XZlZKd73go         
168     Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0      
191     Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel        
389     This is Pilot. He has mastered the synchronized head tilt and sneaky tongue slip. Usually not unlocked until later doggo days. 12/10 https://t.co/YIV8sw8xkh          
395     Here's a very loving and accepting puppo. Appears to have read her Constitution well. 14/10 would pat head approvingly https://t.co/6ao80wIpV1                        
398     Say hello to Pablo. He's one gorgeous puppo. A true 12/10. Click the link to see why Pablo requests your assistance\n\nhttps://t.co/koHvVQp9bL https://t.co/IhW0JKf7kc
413     Here's a super supportive puppo participating in the Toronto  #WomensMarch today. 13/10 https://t.co/nTz3FtorBc                                                       
439     This is Oliver. He has dreams of being a service puppo so he can help his owner. 13/10 selfless af\n\nmake it happen:\nhttps://t.co/f5WMsx0a9K https://t.co/6lJz0DKZIb
554     This is Diogi. He fell in the pool as soon as he was brought home. Clumsy puppo. 12/10 would pet until dry https://t.co/ZxeRjMKaWt                                    
567     This is Loki. He'll do your taxes for you. Can also make room in your budget for all the things you bought today. 12/10 what a puppo https://t.co/5oWrHCWg87          
643     Say hello to Lily. She's pupset that her costume doesn't fit as well as last year. 12/10 poor puppo https://t.co/YSi6K1firY                                           
663     This is Betty. She's assisting with the dishes. Such a good puppo. 12/10 h*ckin helpful af https://t.co/dgvTPZ9tgI                                                    
689     This is Tonks. She is a service puppo. Can hear a caterpillar hiccup from 7 miles away. 13/10 would follow anywhere https://t.co/i622ZbWkUp                           
713     This is Reginald. He's one magical puppo. Aerodynamic af. 12/10 would catch https://t.co/t0cEeRbcXJ                                                                   
736     I want to finally rate this iconic puppo who thinks the parade is all for him. 13/10 would absolutely attend https://t.co/5dUYOu4b8d                                  
922     When ur older siblings get to play in the deep end but dad says ur not old enough. Maybe one day puppo. All 10/10 https://t.co/JrDAzMhwG9                             
947     Hopefully this puppo on a swing will help get you through your Monday. 11/10 would push https://t.co/G54yClasz2                                                       
961     This is Cooper. He's just so damn happy. 10/10 what's your secret puppo? https://t.co/yToDwVXEpA                                                                      
1035    This is Abby. She got her face stuck in a glass. Churlish af. 9/10 rookie move puppo https://t.co/2FPb45NXrK                                                          
1048    This is Kilo. He cannot reach the snackum. Nifty tongue, but not nifty enough. 10/10 maybe one day puppo https://t.co/gSmp31Zrsx                                      
1083    This is Bayley. She fell asleep trying to escape her evil fence enclosure. 11/10 night night puppo https://t.co/AxSiqAKEKu                                            
Name: text, dtype: object

In these tweets the word "puppo" seems to be meaningful.

In [72]:
mask = archive_clean.dog_stages == 'pupper'

archive_clean[mask].text
Out[72]:
29      This is Roscoe. Another pupper fallen victim to spontaneous tongue ejections. Get the BlepiPen immediate. 12/10 deep breaths Roscoe https://t.co/RGE08MIJox           
49      This is Gus. He's quite the cheeky pupper. Already perfected the disinterested wink. 12/10 would let steal my girl https://t.co/D43I96SlVu                            
56      Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF           
82      This is Ginger. She's having a ruff Monday. Too many pupper things going on. H*ckin exhausting. 12/10 would snug passionately https://t.co/j211oCDRs6                 
92      This is Jed. He may be the fanciest pupper in the game right now. Knows it too. 13/10 would sign modeling contract https://t.co/0YplNnSMEm                            
98      This is Sierra. She's one precious pupper. Absolute 12/10. Been in and out of ICU her whole life. Help Sierra below\n\nhttps://t.co/Xp01EU3qyD https://t.co/V5lkvrGLdQ
107     This is Rover. As part of pupper protocol he had to at least attempt to eat the plant. Confirmed not tasty. Needs peanut butter. 12/10 https://t.co/AiVljI6QCg        
135     This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender https://t.co/wk7TfysWHr                                    
199     Sometimes you guys remind me just how impactful a pupper can be. Cooper will be remembered as a good boy by so many. 14/10 rest easy friend https://t.co/oBL7LEJEzR   
220     Say hello to Boomer. He's a sandy pupper. Having a h*ckin blast. 12/10 would pet passionately https://t.co/ecb3LvExde                                                 
249     This is Pickles. She's a silly pupper. Thinks she's a dish. 12/10 would dry https://t.co/7mPCF4ZwEk                                                                   
293     Here's a pupper before and after being asked "who's a good girl?" Unsure as h*ck. 12/10 hint hint it's you https://t.co/ORiK6jlgdH                                    
297     This is Clark. He passed pupper training today. Round of appaws for Clark. 13/10 https://t.co/7pUjwe8X6B                                                              
304     This is Ava. She just blasted off. Streamline af. Aerodynamic as h*ck. One small step for pupper, one giant leap for pupkind. 12/10 https://t.co/W4KffrdX3Q           
330     This is Gidget. She's a spy pupper. Stealthy as h*ck. Must've slipped pup and got caught. 12/10 would forgive then pet https://t.co/zD97KYFaFa                        
352     I couldn't make it to the #WKCDogShow BUT I have people there on the ground relaying me the finest pupper pics possible. 13/10 for all https://t.co/jd6lYhfdH4        
363     This is Astrid. She's a guide doggo in training. 13/10 would follow anywhere https://t.co/xo7FZFIAao                                                                  
378     This is Kona. Yesterday she stopped by the department to see what it takes to be a police pupper. 12/10 vest was only a smidge too big https://t.co/j8D3PQJvpJ        
402     Retweet the h*ck out of this 13/10 pupper #BellLetsTalk https://t.co/wBmc7OaGvS                                                                                       
418     This is Gabe. He was the unequivocal embodiment of a dream meme, but also one h*ck of a pupper. You will be missed by so many. 14/10 RIP https://t.co/M3hZGadUuO      
444     Some happy pupper news to share. 10/10 for everyone involved \nhttps://t.co/MefMAZX2uv                                                                                
478     Here's a pupper with squeaky hiccups. Please enjoy. 13/10 https://t.co/MiMKtsLN6k                                                                                     
483     This is Cooper. Someone attacked him with a sharpie. Poor pupper. 11/10 nifty tongue slip tho https://t.co/01vpuRDXQ8                                                 
515     This is Craig. That's actually a normal sized fence he's stuck on. H*ckin massive pupper. 11/10 someone help him https://t.co/aAUXzoxaBy                              
527     Here's a pupper in a onesie. Quite pupset about it. Currently plotting revenge. 12/10 would rescue https://t.co/xQfrbNK3HD                                            
533     This is Ollie Vue. He was a 3 legged pupper on a mission to overcome everything. This is very hard to write. 14/10 we will miss you Ollie https://t.co/qTRY2qX9y4     
556     Pupper hath acquire enemy. 13/10 https://t.co/ns9qoElfsX                                                                                                              
575     This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj                      
580     Here's a very sleepy pupper. Appears to be portable as h*ck. 12/10 would snug intensely https://t.co/61sX7pW5Ca                                                       
608     Here's a helicopter pupper. He takes off at random. H*ckin hard to control. 12/10 rare af https://t.co/GRWPgNKt2z                                                     
                                                              ...                                                                                                             
1875    Meet Zuzu. He just graduated college. Astute pupper. Needs 2 leashes to contain him. Wasn't ready for the pic. 10/10 https://t.co/2H5SKmk0k7                          
1880    Say hello to Mollie. This pic was taken after she bet all her toys on Ronda Rousey. 10/10 hang in there pupper https://t.co/QMmAqA9VqO                                
1889    This is Superpup. His head isn't proportional to his body. Has yet to serve any justice. 11/10 maybe one day pupper https://t.co/gxIFgg8ktm                           
1897    Meet Rufio. He is unaware of the pink legless pupper wrapped around him. Might want to get that checked 10/10 &amp; 4/10 https://t.co/KNfLnYPmYh                      
1903    This pupper is fed up with being tickled. 12/10 I'm currently working on an elaborate heist to steal this dog https://t.co/F33n1hy3LL                                 
1907    This pupper just wants a belly rub. This pupper has nothing to do w the tree being sideways now. 10/10 good pupper https://t.co/AyJ7Ohk71f                            
1915    This is Lennon. He's in quite the predicament. 8/10 hang in there pupper https://t.co/7mf8XXPAZv                                                                      
1921    This is Gus. He's super stoked about being an elephant. Couldn't be happier. 9/10 for elephant pupper https://t.co/gJS1qU0jP7                                         
1930    This is Kaiya. She's an aspiring shoe model. 12/10 follow your dreams pupper https://t.co/nX8FiGRHvk                                                                  
1936    This is one esteemed pupper. Just graduated college. 10/10 what a champ https://t.co/nyReCVRiyd                                                                       
1937    This is Obie. He is on guard watching for evildoers from the comfort of his pumpkin. Very brave pupper. 11/10 https://t.co/cdwPTsGEAb                                 
1945    This is Raymond. He's absolutely terrified of floating tennis ball. 10/10 it'll be ok pupper https://t.co/QyH1CaY3SM                                                  
1948    This is Pickles. She's a tiny pointy pupper. Average walker. Very skeptical of wet leaf. 8/10 https://t.co/lepRCaGcgw                                                 
1954    This is Albert AKA King Banana Peel. He's a kind ruler of the kitchen. Very jubilant pupper. 10/10 overall great dog https://t.co/PN8hxgZ9We                          
1956    This is Jeffri. He's a speckled ice pupper. Very lazy. Enjoys the occasional swim. Rather majestic really. 7/10 https://t.co/0iyItbtkr8                               
1960    This little pupper can't wait for Christmas. He's pretending to be a present. S'cute. 11/10 twenty more days 🎁🎄🐶 https://t.co/m8r9rbcgX4                              
1967    This is Django. He's a skilled assassin pupper. 10/10 https://t.co/w0YTuiRd1a                                                                                         
1970    Meet Eve. She's a raging alcoholic 8/10 (would b 11/10 but pupper alcoholism is a tragic issue that I can't condone) https://t.co/U36HYQIijg                          
1974    This is Fletcher. He's had a ruff night. No more Fireball for Fletcher. 8/10 it'll be over soon pupper https://t.co/tA4WpkI2cw                                        
1977    This is Schnozz. He's had a blurred tail since birth. Hasn't let that stop him. 10/10 inspirational pupper https://t.co/a3zYMcvbXG                                    
1980    This is Chuckles. He is one skeptical pupper. 10/10 stay woke Chuckles https://t.co/ZlcF0TIRW1                                                                        
1981    This is Chet. He's having a hard time. Really struggling. 7/10 hang in there pupper https://t.co/eb4ta0xtnd                                                           
1985    This is Cheryl AKA Queen Pupper of the Skies. Experienced fighter pilot. Much skill. True hero. 11/10 https://t.co/i4XJEWwdsp                                         
1991    This lil pupper is sad because we haven't found Kony yet. RT to spread awareness. 12/10 would pet firmly https://t.co/Cv7dRdcMvQ                                      
1992    This is Norman. Doesn't bark much. Very docile pup. Up to date on current events. Overall nifty pupper. 6/10 https://t.co/ntxsR98f3U                                  
1995    Meet Scott. Just trying to catch his train to work. Doesn't need everybody staring. 9/10 ignore the haters pupper https://t.co/jyXbZ35MYz                             
2002    Say hello to Jazz. She should be on the cover of Vogue. 12/10 gorgeous pupper https://t.co/mVCMemhXAP                                                                 
2009    This is Rolf. He's having the time of his life. 11/10 good pupper https://t.co/OO6MqEbqG3                                                                             
2015    This is Opal. He's a Royal John Coctostan. Ready for transport. Basically indestructible. 9/10 good pupper https://t.co/yRBQF9OS7D                                    
2017    This is Bubba. He's a Titted Peebles Aorta. Evolutionary masterpiece. Comfortable with his body. 8/10 great pupper https://t.co/aNkkl5nH3W                            
Name: text, Length: 223, dtype: object
In these tweets the word "pupper" seems to be meaningful.

  1. Explore the rating numerators and denominators to define if the ratings can be corrected or should be excluded.
In [73]:
# denominators 

denom_not_10 = archive_clean.rating_denominator.value_counts().index.tolist()[1:]
In [74]:
mask = archive_clean.rating_denominator.isin(denom_not_10)

archive_clean[mask].text
Out[74]:
433     The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd                                                                      
516     Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
902     Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE                                                                                           
1068    After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ                             
1120    Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv                                                
1165    Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a                                                                                                         
1202    This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq                                                    
1228    Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1                                                                            
1254    Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12                                                             
1274    From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK                       
1351    Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa                                                                                       
1433    Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ                                                                             
1635    Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55                             
1662    This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5                              
1779    IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq                                                                                                   
1843    Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw                                                              
2335    This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv                                 
Name: text, dtype: object

There are two main types of mistakes.

  1. The first occurrence of / was picked up, but it was a date of something else, but not the rating. These are scares and can be replaced manually.
  2. There are many dogs on the picture, their number goes as a multiplier.

Of the tweets above:
516 - no rating, should be excluded
1068 - wrong numbers taken for ratings, should be 14/10
1165 - wrong numbers taken for ratings, should be 13/10
1202 - wrong numbers taken for rating, should be 11/10
1662 - wrong numbers taken for rating, should be 10/10
2335 - wrong numbers taken for rating, should be 9/10

In other tweets ratings are "adjusted" by the number of dogs in the picture. Since the ratings will be used in float forms, this can be left as is for further division.

In [75]:
archive_clean = archive_clean.drop(516)

archive_clean.rating_numerator[1068] = 14
archive_clean.rating_denominator[1068] = 10

archive_clean.rating_numerator[1165] = 13
archive_clean.rating_denominator[1165] = 10

archive_clean.rating_numerator[1202] = 11
archive_clean.rating_denominator[1202] = 10

archive_clean.rating_numerator[1662] = 10
archive_clean.rating_denominator[1662] = 10

archive_clean.rating_numerator[2335] = 9
archive_clean.rating_denominator[2335] = 10
In [76]:
archive_clean.rating_denominator.value_counts()
Out[76]:
10     2085
80     2   
50     2   
170    1   
150    1   
120    1   
110    1   
90     1   
70     1   
40     1   
Name: rating_denominator, dtype: int64
In [77]:
checked_denominators = archive_clean.rating_denominator.value_counts().index.tolist()[1:]
mask = ~archive_clean.rating_denominator.isin(checked_denominators)
In [78]:
# numerators

archive_clean[mask].rating_numerator.value_counts()
Out[78]:
12      486
10      437
11      414
13      288
9       153
8       98 
7       51 
14      39 
5       34 
6       32 
3       19 
4       15 
2       9  
1       4  
75      1  
420     1  
26      1  
27      1  
1776    1  
0       1  
Name: rating_numerator, dtype: int64
In [79]:
mask_num = (archive_clean[mask].rating_numerator > 14)

archive_clean.loc[mask_num[mask_num == True].index, :][['tweet_id', 'text']]
Out[79]:
tweet_id text
695 786709082849828864 This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
763 778027034220126208 This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
979 749981277374128128 This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
1712 680494726643068929 Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD
2074 670842764863651840 After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY

There are several tweets where ratings are not in typical forms because of special occasions, like Christmas. Three last tweets may be dropped. Also not all numerators seem to be in integer format, may be useful to check for halves.

In [80]:
archive_clean = archive_clean.drop([979, 1712, 2074])
In [81]:
archive_clean.rating_numerator = archive_clean.rating_numerator.astype(float)

archive_clean.rating_numerator[695] = 9.75
archive_clean.rating_numerator[763] = 11.27
In [82]:
mask = archive_clean.rating_numerator == 5

archive_clean[mask].text
Out[82]:
45      This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948        
730     Who keeps sending in pictures without dogs in them? This needs to stop. 5/10 for the mediocre road https://t.co/ELqelxWMrC                      
956     Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8        
1399    This is Dave. He's a tropical pup. Short lil legs (dachshund mix?) Excels underwater, but refuses to eat kibble 5/10 https://t.co/ZJnCxlIf62    
1461    Please only send in dogs. This t-rex is very scary. 5/10 ...might still pet (vid by @helizabethmicha) https://t.co/Vn6w5w8TO2                   
1508    When bae says they can't go out but you see them with someone else that same night. 5/10 &amp; 10/10 for heartbroken pup https://t.co/aenk0KpoWM
1583    Army of water dogs here. None of them know where they're going. Have no real purpose. Aggressive barks. 5/10 for all https://t.co/A88x73TwMN    
1619    This is Jerry. He's a neat dog. No legs (tragic). Has more horns than a dog usually does. Bark is unique af. 5/10 https://t.co/85q7xlplsJ       
1624    Here we have a basking dino pupper. Looks powerful. Occasionally shits eggs. Doesn't want the holidays to end. 5/10 https://t.co/DnNweb5eTO     
1645    This is Jiminy. He's not the brightest dog. Needs to lay off the kibble. 5/10 still petable https://t.co/omln4LOy1x                             
1680    Unique dog here. Wrinkly as hell. Weird segmented neck. Finger on fire. Doesn't seem to notice. 5/10 might still pet https://t.co/Hy9La4xNX3    
1727    Meet Penelope. She's a bacon frise. Total babe (lol get it like the movie). Doesn't bark tho. 5/10 very average dog https://t.co/SDcQYg0HSZ     
1796    This is Juckson. He's totally on his way to a nascar race. 5/10 for Juckson https://t.co/IoLRvF0Kak                                             
1808    Exotic handheld dog here. Appears unathletic. Feet look deadly. Can be thrown a great distance. 5/10 might pet idk https://t.co/Avq4awulqk      
1820    This is Bubbles. He kinda resembles a fish. Always makes eye contact with u no matter what. Sneaky tongue slip. 5/10 https://t.co/Nrhvc5tLFT    
1861    Rare shielded battle dog here. Very happy about abundance of lettuce. Painfully slow fetcher. Still petable. 5/10 https://t.co/C3tlKVq7eO       
1874    This is Steven. He got locked outside. Damn it Steven. 5/10 nice grill tho https://t.co/zf7Sxxjfp3                                              
1901    Two gorgeous dogs here. Little waddling dog is a rebel. Refuses to look at camera. Must be a preteen. 5/10 &amp; 8/10 https://t.co/YPfw7oahbD   
1904    Rare submerged pup here. Holds breath for a long time. Frowning because that spoon ignores him. 5/10 would still pet https://t.co/EJzzNHE8bE    
1925    This is Earl. Earl is lost. Someone help Earl. He has no tags. Just trying to get home. 5/10 hang in there Earl https://t.co/1ZbfqAVDg6         
1979    Extraordinary dog here. Looks large. Just a head. No body. Rather intrusive. 5/10 would still pet https://t.co/ufHWUFA9Pu                       
2013    Exotic underwater dog here. Very shy. Wont return tennis balls I toss him. Never been petted. 5/10 I bet he's soft https://t.co/WH7Nzc5IBA      
2026    This is Brad. He's a chubby lil pup. Doesn't really need the food he's trying to reach. 5/10 you've had enough Brad https://t.co/vPXKSaNsbE     
2063    This is Anthony. He just finished up his masters at Harvard. Unprofessional tattoos. Always looks perturbed. 5/10 https://t.co/iHLo9rGay1       
2092    This dude slaps your girl's ass what do you do?\n5/10 https://t.co/6dioUL6gcP                                                                   
2109    Vibrant dog here. Fabulous tail. Only 2 legs tho. Has wings but can barely fly (lame). Rather elusive. 5/10 okay pup https://t.co/cixC0M3P1e    
2134    This is Randall. He's from Chernobyl. Built playground himself. Has been stuck up there quite a while. 5/10 good dog https://t.co/pzrvc7wKGd    
2139    Awesome dog here. Not sure where it is tho. Spectacular camouflage. Enjoys leaves. Not very soft. 5/10 still petable https://t.co/rOTOteKx4q    
2153    This is a brave dog. Excellent free climber. Trying to get closer to God. Not very loyal though. Doesn't bark. 5/10 https://t.co/ODnILTr4QM     
2181    Two gorgeous pups here. Both have cute fake horns(adorable). Barn in the back looks on fire. 5/10 would pet rly well https://t.co/w5oYFXi0uh    
2206    Meet Zeek. He is a grey Cumulonimbus. Zeek is hungry. Someone should feed Zeek asap. 5/10 absolutely terrifying https://t.co/fvVNScw8VH         
2242    Wow. Armored dog here. Ready for battle. Face looks dangerous. Not very loyal. Lil dog on back havin a blast. 5/10 https://t.co/SyMoWrp368      
2312    This is Josep. He is a Rye Manganese mix. Can drive w eyes closed. Very irresponsible. Menace on the roadways. 5/10 https://t.co/XNGeDwrtYH     
2351    Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq                        
Name: text, dtype: object
In [83]:
archive_clean.rating_numerator[45] = 13.5
In [84]:
archive_clean.rating_numerator.value_counts()
Out[84]:
12.00     486
10.00     437
11.00     414
13.00     288
9.00      153
8.00      98 
7.00      51 
14.00     39 
5.00      33 
6.00      32 
3.00      19 
4.00      15 
2.00      9  
1.00      4  
60.00     1  
11.27     1  
45.00     1  
204.00    1  
13.50     1  
9.75      1  
121.00    1  
84.00     1  
0.00      1  
80.00     1  
88.00     1  
144.00    1  
44.00     1  
165.00    1  
99.00     1  
Name: rating_numerator, dtype: int64

  1. Combine the cleaned 'rating_numerator' and 'rating_denominator' columns in one 'rating' column in float format.
In [85]:
archive_clean['rating'] = archive_clean.rating_numerator / archive_clean.rating_denominator
In [86]:
archive_clean.rating.describe()
Out[86]:
count    2093.000000
mean     1.061468   
std      0.214564   
min      0.000000   
25%      1.000000   
50%      1.100000   
75%      1.200000   
max      1.400000   
Name: rating, dtype: float64
In [87]:
archive_clean = archive_clean[['tweet_id', 'timestamp', 'source', 
                               'text', 'expanded_urls', 'name', 
                               'floofer', 'dog_stages', 'rating']]

archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2093 entries, 0 to 2355
Data columns (total 9 columns):
tweet_id         2093 non-null int64
timestamp        2093 non-null datetime64[ns]
source           2093 non-null category
text             2093 non-null object
expanded_urls    2090 non-null object
name             1410 non-null object
floofer          10 non-null object
dog_stages       324 non-null object
rating           2093 non-null float64
dtypes: category(1), datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 229.4+ KB

  1. Join twitter archive data with image predictions data and additional information from Twitter on tweet IDs. Keep only rows with data in all three dataframes
In [88]:
twitter_archive_master = archive_clean.merge(image_predictions, on = 'tweet_id', suffixes = ('', '_imp'))
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1967 entries, 0 to 1966
Data columns (total 20 columns):
tweet_id         1967 non-null int64
timestamp        1967 non-null datetime64[ns]
source           1967 non-null category
text             1967 non-null object
expanded_urls    1967 non-null object
name             1369 non-null object
floofer          8 non-null object
dog_stages       293 non-null object
rating           1967 non-null float64
jpg_url          1967 non-null object
img_num          1967 non-null int64
p1               1967 non-null object
p1_conf          1967 non-null float64
p1_dog           1967 non-null bool
p2               1967 non-null object
p2_conf          1967 non-null float64
p2_dog           1967 non-null bool
p3               1967 non-null object
p3_conf          1967 non-null float64
p3_dog           1967 non-null bool
dtypes: bool(3), category(1), datetime64[ns](1), float64(4), int64(2), object(9)
memory usage: 269.1+ KB
In [89]:
twitter_archive_master = twitter_archive_master.merge(tweet_jsons, on = 'tweet_id', suffixes = ('', '_jsons'))
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1965 entries, 0 to 1964
Data columns (total 22 columns):
tweet_id          1965 non-null int64
timestamp         1965 non-null datetime64[ns]
source            1965 non-null category
text              1965 non-null object
expanded_urls     1965 non-null object
name              1367 non-null object
floofer           8 non-null object
dog_stages        293 non-null object
rating            1965 non-null float64
jpg_url           1965 non-null object
img_num           1965 non-null int64
p1                1965 non-null object
p1_conf           1965 non-null float64
p1_dog            1965 non-null bool
p2                1965 non-null object
p2_conf           1965 non-null float64
p2_dog            1965 non-null bool
p3                1965 non-null object
p3_conf           1965 non-null float64
p3_dog            1965 non-null bool
favorite_count    1965 non-null int64
retweet_count     1965 non-null int64
dtypes: bool(3), category(1), datetime64[ns](1), float64(4), int64(4), object(9)
memory usage: 299.5+ KB
In [90]:
# writing cleaned data to csv file
twitter_archive_master.to_csv('twitter_archive_master.csv', index = False)

Data Analysis

A separate text-only report on data analysis in HTML format was recreated with R Markdown and includes a little less findings than the code here, for it was becoming too long. See it here.

In [91]:
# setting up graphics
import matplotlib.pyplot as plt 

% matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
In [92]:
# loading cleaned data
df = pd.read_csv('twitter_archive_master.csv')
In [93]:
# fixing types
df['timestamp'] = pd.to_datetime(df.timestamp)
df['dog_stages'] = df.dog_stages.astype('category')
df['source'] = df.source.astype('category')
df = df.set_index('timestamp')
In [94]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1965 entries, 2017-08-01 16:23:56 to 2015-11-15 22:32:08
Data columns (total 21 columns):
tweet_id          1965 non-null int64
source            1965 non-null category
text              1965 non-null object
expanded_urls     1965 non-null object
name              1367 non-null object
floofer           8 non-null object
dog_stages        293 non-null category
rating            1965 non-null float64
jpg_url           1965 non-null object
img_num           1965 non-null int64
p1                1965 non-null object
p1_conf           1965 non-null float64
p1_dog            1965 non-null bool
p2                1965 non-null object
p2_conf           1965 non-null float64
p2_dog            1965 non-null bool
p3                1965 non-null object
p3_conf           1965 non-null float64
p3_dog            1965 non-null bool
favorite_count    1965 non-null int64
retweet_count     1965 non-null int64
dtypes: bool(3), category(2), float64(4), int64(4), object(8)
memory usage: 270.9+ KB

Ok, Python. Who is the most favorited dog of all times at @dog_rates? At least in this dataset.

In [95]:
top_dog = df.loc[df.favorite_count.idxmax(), : ]

print("Tweet:", top_dog.text + "\n", 
      "Favorite count: ", str(top_dog.favorite_count) + "\n", 
      "Retweet_count:", top_dog.retweet_count)
Tweet: Here's a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad) https://t.co/7wE9LTEXC4
 Favorite count:  163530
 Retweet_count: 83146
In [96]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url = top_dog.jpg_url)
Out[96]:

It is actually a video. And maybe you should take a look, too. But I guess, I'm not the first who suggests that. By the way, the lowest rating received a screenshot from another Twitter account fot plagiarism. Do you agree?

In [97]:
df.loc[df.rating.idxmin(), ].text
Out[97]:
"When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag"

Ok, let's be a bit more serious. The cleaned data set consists of 1965 rows and 22 variables, including data from the WeRateDogs Twitter archive, addtional Twitter data, gathered by API, and dog breed predictions, made by a neural network.

In [98]:
df.describe()
Out[98]:
tweet_id rating img_num p1_conf p2_conf p3_conf favorite_count retweet_count
count 1.965000e+03 1965.000000 1965.000000 1965.000000 1.965000e+03 1.965000e+03 1965.000000 1965.000000
mean 7.360774e+17 1.054734 1.202545 0.594573 1.346915e-01 6.021736e-02 8740.952672 2650.865649
std 6.756862e+16 0.216694 0.559762 0.272061 1.010774e-01 5.099516e-02 12810.601111 4724.318781
min 6.660209e+17 0.000000 1.000000 0.044333 1.011300e-08 1.740170e-10 78.000000 11.000000
25% 6.758531e+17 1.000000 1.000000 0.362925 5.351500e-02 1.605590e-02 1887.000000 591.000000
50% 7.088343e+17 1.100000 1.000000 0.587764 1.175080e-01 4.934910e-02 3939.000000 1273.000000
75% 7.881506e+17 1.200000 1.000000 0.847292 1.955730e-01 9.160200e-02 10892.000000 3028.000000
max 8.924206e+17 1.400000 4.000000 1.000000 4.880140e-01 2.734190e-01 163530.000000 83146.000000

As can be seen from the summary statistics on favorites, with the mean favorite count of about 8741, our top dog is a real outlier. Same is true for the retweets - the mean is about 2651. The distributions seem to be noticeably right-skewed, we can change that with histograms.

In [99]:
df.favorite_count.hist(bins = 100);
In [100]:
df.retweet_count.hist(bins = 100);
In [101]:
timestamp = df.index

plt.hist(timestamp, bins = 100);

As can be seen from the plot above, WeRateDogs took a lot of effort to promote the account, posting quite frequently during the first months. We can see if it paid off with mean retweet and favorite counts per months.

In [102]:
plot = df.groupby([df.index.year, df.index.month]).retweet_count.mean().plot()
plot.set(xlabel = 'Time', ylabel = 'Count', title = 'Mean Retweet Count Per Month');
In [103]:
plot = df.groupby([df.index.year, df.index.month]).favorite_count.mean().plot()
plot.set(xlabel = 'Time', ylabel = 'Count', title = 'Mean Favorite Count Per Month');
In [104]:
df.rating.mean(), df.rating.median()
Out[104]:
(1.0547338422391856, 1.1)

The median rating is 11/10 and the interquartile range is between 10/10 and 12/10.

In [105]:
df.rating.hist(bins = 20);

But it seems that a dog doesn't need to have the highest possible rating to be most popular - the highest favorite and retweet counts are in 13/10 group (see the plots below). Maybe, 14/10 is too subjective?

In [106]:
plot = df.plot.scatter(x = 'rating', y = 'favorite_count')
plot.set(xlabel = 'Rating', ylabel = 'Favorites', title = 'Favorites vs Rating');
In [107]:
plot = df.plot.scatter(x = 'rating', y = 'retweet_count')
plot.set(xlabel = 'Rating', ylabel = 'Retweets', title = 'Retweets vs Rating');
In [108]:
plot = df.plot.scatter(x = 'retweet_count', y = 'favorite_count')
plot.set(xlabel = 'Retweets', ylabel = 'Favorites', title = 'Favoriting & Retweeting');

The more retweets, the more likes. Did you expect that? Or should it be the other way around?

In [109]:
plot = df.boxplot(column = 'rating', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Rating', title = 'Rating By Dog Stages');

#  This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)

If you like puppies, I may have some bad news for you: their cuteness seems to win them on average lower rating, than the other stages demonstrate. The following two boxplots on ratings and favorites show the same tendency.

In [110]:
plot = df.boxplot(column = 'favorite_count', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Favorites', title = 'Favorites By Dog Stages');

#  This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)
In [111]:
plot = df.boxplot(column = 'retweet_count', by = 'dog_stages')
plot.set(xlabel = 'Dog Stages', ylabel = 'Retweets', title = 'Retweets By Dog Stages');

#  This cell doesn't produce any warnings on my local machine. See the act_report.html.
# It seems like Project Workspace needs some upgrade )
/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)
In [112]:
df.dog_stages.value_counts()
Out[112]:
pupper          203
doggo           62 
puppo           24 
pupper,doggo    4  
Name: dog_stages, dtype: int64

It's a pity that there is not enough data to judge if a pair of a dog with a pup really doing on average better than others. But we can use our subjective expert opinion here. Aren't they great?

In [114]:
parents = list(df[df.dog_stages == 'pupper,doggo'].jpg_url)


from skimage import io

imgs = []
for pair in parents:
    imgs.append(io.imread(pair, 0))

plt.figure(figsize=(20,5))
columns = 4
for i, img  in enumerate(imgs):
    plt.subplot(len(imgs) / columns + 1, columns, i + 1)
    plt.imshow(img)

Reference

  1. Heavily exploited StackOverflow post on multiple images in Jupyter notebooks.
In [ ]: