by Tatiana Kurilo


Data Gathering

Loading Twitter Archive Data Locally

I downloaded twitter_archive_enhanced.csv, uploaded it to the Project Workspace on Udacity and read it to the dataframe twitter_archive with pandas.

Downloading Image Prediction Data Programmatically

I wrote the code to download image_predictions.tsv to Project Workspace directly and read it to the dataframe image_predictions with pandas.

Gathering Additional Information Via Twitter API

I wrote a script to get Twitter JSON data via API with Tweepy library, using the list of Tweet IDs from twitter_archive dataframe, and saved it to tweet_json.txt file. I uploaded this file to the Project Workspace and added the code of the script to the project notebook without authentification keys. Since it would cause errors if left that way, I commented the cell that contains the code. I read the data from tweet_json.txt to the dataframe tweet_jsons, using json and pandas libraries.


Data Assessing

Twitter Archive Data

There are 17 variable in twitter_archive dataframe, first 10 of which are from the original Twitter data, and 7 were added later, based mostly on the content of the tweets.

  1. tweet_id - Twitter identifier for the Tweet.
  2. in_reply_to_status_id - if the represented Tweet is a reply, this will contain the original Tweet’s ID.
  3. in_reply_to_user_id - if the represented Tweet is a reply, this will contain the original Tweet’s author ID.
  4. timestamp - UTC time when the Tweet was created.
  5. source - utility (e.g., iPhone, Android, Web client) used to post the Tweet.
  6. text - the actual text of the Tweet.
  7. retweeted_status_id - the unique identifier for the original Tweet if this is a retweet, otherwise it is null.
  8. retweeted_status_user_id - unique identifier of the original Tweet’s author.
  9. retweeted_status_timestamp UTC time when the original Tweet was created.
  10. expanded_urls - a list of expanded URLs from Twitter entities or expanded entities.

  11. rating_numerator - the numerator of dog rating from the Tweet (M in M/N, integer).
  12. rating_denominator - - the denominator of dog rating from the Tweet (N in M/N, equals to 10).
  13. name - name of the dog, extracted from the Tweet text (“None”, if missing).
  14. doggo - one of the dog stages (oldest), extracted from the Tweet’s text (“None”, if missing).
  15. floofer - furry dog description, extracted from the Tweet’s text (“None”, if missing).
  16. pupper - one of the dog stages (youngest), extracted from the Tweet’s text (“None”, if missing).
  17. puppo - one of the dog stages (middle), extracted from the Tweet’s text (“None”, if missing).

The descriptions of Twitter data used in the list come from this source.

Quality Issues In Twitter Archive Data

  1. 78 tweets are replies and can’t be counted as tweets of “standart format” with an image, a text presenting the dog in the image and a rating number, as they often lack some of this informantion.
  2. 181 tweets have non-null values in retweeted_status_id and retweeted_status_user_id, which means that these tweets are actually retweets, and this doesn’t follow the project guidelines.
  3. Missing values in dog stage columns, floofer and name column encoded with “None” strings, and not pandas NaN values.
  4. Values in timestamp and retweeted_status_timestamp columns not in datetime format.
  5. Values in in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns in float format and scientific notation, but in case of removing retweets and replies, this won’t need any additional actions for these columns together with retweeted_status_timestamp will be columns with null values only and can be dropped.
  6. Some rating numerators are too large for “M/10” pattern - it is ok, when M is larger than 10 by some points, but not in times. Some are unexpectedly low.
  7. Some rating denominators isn’t equal to 10.
  8. Name column contain articles and other “non-name” words.
  9. Values in source column are links with HTML wrapped around the actual content, which doesn’t improve readability. Also, the type of the column should be category.
  10. Duplicated URLs in the same cells of expanded_urls column.

Tidiness Issues In Twitter Archive Data

  1. Dog stages columns - pupper, puppo and doggo - may be combined in one as the levels of one categorical variable. Still, dual values for many dogs in a picture may occur.
  2. Rating columns - rating_numerator and rating_denominator - should be used to calculate one rating value in float format to be used in analysis.

Image Prediction Data

The image_predictions dataframe contains top three image predictions about dog breed made by neural network, based on images in the tweets. There are 12 variable in the dataframe.

  1. tweet_id - Twitter identifier for the Tweet.
  2. jpg_url - the URL of the image used for dog breed prediction.
  3. img_num - the number of the image used for dog breed prediction (1 to 4 since tweets can have up to four images).
  4. p1 - the algorithm’s #1 prediction for the image in the Tweet.
  5. p1_conf - confidence estimation for the #1 prediction.
  6. p1_dog - whether or not the #1 prediction is a breed of dog.
  7. p2 - the algorithm’s #2 prediction for the image in the Tweet.
  8. p2_conf - confidence estimation for the #2 prediction.
  9. p2_dog - whether or not the #2 prediction is a breed of dog.
  10. p3 - the algorithm’s #3 prediction for the image in the Tweet.
  11. p3_conf - confidence estimation for the #3 prediction.
  12. p3_dog - whether or not the #3 prediction is a breed of dog.

Quality Issues

  • Missing information for 281 tweets in Twitter Archive Data.
  • Underscores in predictions may be changed to spaces for readability.
  • Predictions may be changed to category type.

In some cases it may by reasonable also to combine the predictions into three columns:
Number Of Prediction | Dog Breed | Confidence
but for the purpose of this project where such changes may lead to many rows with the same tweets, it seems unreasonable.

Additional Twitter Data

Though there are full tweets’ data accessible in the text file, for the purpose of the project only the following variables will be used in tweet_jsons dataframe:

  1. tweet_id - Twitter identifier for the Tweet.
  2. retweet_count - the number of times the Tweet has been retweeted.
  3. favorite_count - the approximate number of times the Tweet has been liked by Twitter users.

Quality Issues

  • Missing information for 16 tweets in Twitter Archive Data: Twitter returned “No status found with that ID” message.

Data Cleaning

Define

The following steps need to be taken to clean and combine the data for further analysis.

  1. Identify and exclude the rows in twitter_archive dataframe that correspond to retweets and replies.
  2. Exclude in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns.
  3. Convert values in timestamp column to datetime format.
  4. Clean HTML information in source column and convert it to category type.
  5. Remove duplicated URLs from expanded_urls column.
  6. Replace None values in dog_stages and name with pandas NaN values.
  7. Check if any names can be extracted from tweets with non-name words in name column and add the proper names, if any.
  8. Replace other “non-name” values in name column with NaN values.
  9. Combine pupper, puppo and doggo columns in one dog_stages column.
  10. Check dog_stages for correctness.
  11. Explore the rating numerators and denominators to define if the ratings can be corrected or should be excluded.
  12. Combine the cleaned rating_numerator and rating_denominator columns in one rating column in float format.

  13. Join twitter_archive dataframe with image_predictions and tweet_jsons dataframe on tweet_id/id columns, removing the rows which tweet IDs are not present in all three dataframes.

Code & Test

The code and test results are accessible in a separate Jupyter Notebook. Some steps required data re-assessing after cleaning (e.g. dog stages).


Data Analysis is available in a separate report file.