I downloaded twitter_archive_enhanced.csv
, uploaded it to the Project Workspace on Udacity and read it to the dataframe twitter_archive
with pandas
.
I wrote the code to download image_predictions.tsv
to Project Workspace directly and read it to the dataframe image_predictions
with pandas
.
I wrote a script to get Twitter JSON data via API with Tweepy
library, using the list of Tweet IDs from twitter_archive
dataframe, and saved it to tweet_json.txt
file. I uploaded this file to the Project Workspace and added the code of the script to the project notebook without authentification keys. Since it would cause errors if left that way, I commented the cell that contains the code. I read the data from tweet_json.txt
to the dataframe tweet_jsons
, using json
and pandas
libraries.
There are 17 variable in twitter_archive
dataframe, first 10 of which are from the original Twitter data, and 7 were added later, based mostly on the content of the tweets.
expanded_urls - a list of expanded URLs from Twitter entities or expanded entities.
puppo - one of the dog stages (middle), extracted from the Tweet’s text (“None”, if missing).
The descriptions of Twitter data used in the list come from this source.
retweeted_status_id
and retweeted_status_user_id
, which means that these tweets are actually retweets, and this doesn’t follow the project guidelines.floofer
and name
column encoded with “None” strings, and not pandas
NaN
values.timestamp
and retweeted_status_timestamp
columns not in datetime format.in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_id
and retweeted_status_user_id
columns in float format and scientific notation, but in case of removing retweets and replies, this won’t need any additional actions for these columns together with retweeted_status_timestamp
will be columns with null values only and can be dropped.Name
column contain articles and other “non-name” words.source
column are links with HTML wrapped around the actual content, which doesn’t improve readability. Also, the type of the column should be category.expanded_urls
column.pupper
, puppo
and doggo
- may be combined in one as the levels of one categorical variable. Still, dual values for many dogs in a picture may occur.rating_numerator
and rating_denominator
- should be used to calculate one rating value in float format to be used in analysis.The image_predictions
dataframe contains top three image predictions about dog breed made by neural network, based on images in the tweets. There are 12 variable in the dataframe.
In some cases it may by reasonable also to combine the predictions into three columns:
Number Of Prediction | Dog Breed | Confidence
but for the purpose of this project where such changes may lead to many rows with the same tweets, it seems unreasonable.
Though there are full tweets’ data accessible in the text file, for the purpose of the project only the following variables will be used in tweet_jsons
dataframe:
The following steps need to be taken to clean and combine the data for further analysis.
twitter_archive
dataframe that correspond to retweets and replies.in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_id
and retweeted_status_user_id
columns.timestamp
column to datetime format.source
column and convert it to category type.expanded_urls
column.None
values in dog_stages
and name
with pandas
NaN
values.name
column and add the proper names, if any.name
column with NaN
values.pupper
, puppo
and doggo
columns in one dog_stages
column.dog_stages
for correctness.Combine the cleaned rating_numerator
and rating_denominator
columns in one rating
column in float format.
Join twitter_archive
dataframe with image_predictions
and tweet_jsons
dataframe on tweet_id
/id
columns, removing the rows which tweet IDs are not present in all three dataframes.
The code and test results are accessible in a separate Jupyter Notebook. Some steps required data re-assessing after cleaning (e.g. dog stages).
Data Analysis is available in a separate report file.