I downloaded twitter_archive_enhanced.csv, uploaded it to the Project Workspace on Udacity and read it to the dataframe twitter_archive with pandas.
I wrote the code to download image_predictions.tsv to Project Workspace directly and read it to the dataframe image_predictions with pandas.
I wrote a script to get Twitter JSON data via API with Tweepy library, using the list of Tweet IDs from twitter_archive dataframe, and saved it to tweet_json.txt file. I uploaded this file to the Project Workspace and added the code of the script to the project notebook without authentification keys. Since it would cause errors if left that way, I commented the cell that contains the code. I read the data from tweet_json.txt to the dataframe tweet_jsons, using json and pandas libraries.
There are 17 variable in twitter_archive dataframe, first 10 of which are from the original Twitter data, and 7 were added later, based mostly on the content of the tweets.
expanded_urls - a list of expanded URLs from Twitter entities or expanded entities.
puppo - one of the dog stages (middle), extracted from the Tweet’s text (“None”, if missing).
The descriptions of Twitter data used in the list come from this source.
retweeted_status_id and retweeted_status_user_id, which means that these tweets are actually retweets, and this doesn’t follow the project guidelines.floofer and name column encoded with “None” strings, and not pandas NaN values.timestamp and retweeted_status_timestamp columns not in datetime format.in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns in float format and scientific notation, but in case of removing retweets and replies, this won’t need any additional actions for these columns together with retweeted_status_timestamp will be columns with null values only and can be dropped.Name column contain articles and other “non-name” words.source column are links with HTML wrapped around the actual content, which doesn’t improve readability. Also, the type of the column should be category.expanded_urls column.pupper, puppo and doggo - may be combined in one as the levels of one categorical variable. Still, dual values for many dogs in a picture may occur.rating_numerator and rating_denominator - should be used to calculate one rating value in float format to be used in analysis.The image_predictions dataframe contains top three image predictions about dog breed made by neural network, based on images in the tweets. There are 12 variable in the dataframe.
In some cases it may by reasonable also to combine the predictions into three columns:
Number Of Prediction | Dog Breed | Confidence
but for the purpose of this project where such changes may lead to many rows with the same tweets, it seems unreasonable.
Though there are full tweets’ data accessible in the text file, for the purpose of the project only the following variables will be used in tweet_jsons dataframe:
The following steps need to be taken to clean and combine the data for further analysis.
twitter_archive dataframe that correspond to retweets and replies.in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id columns.timestamp column to datetime format.source column and convert it to category type.expanded_urls column.None values in dog_stages and name with pandas NaN values.name column and add the proper names, if any.name column with NaN values.pupper, puppo and doggo columns in one dog_stages column.dog_stages for correctness.Combine the cleaned rating_numerator and rating_denominator columns in one rating column in float format.
Join twitter_archive dataframe with image_predictions and tweet_jsons dataframe on tweet_id/id columns, removing the rows which tweet IDs are not present in all three dataframes.
The code and test results are accessible in a separate Jupyter Notebook. Some steps required data re-assessing after cleaning (e.g. dog stages).
Data Analysis is available in a separate report file.