I’m a big fan of use functional programming where appropriate, particularly with data transformations. This notebook serves as a way to guide you, the reader, through the process of the data, but it doesn’t serve as the primary structure for this project. As I am writing and working through the code, I will also be refactoring this into separate .py files for later usage. As a result, you’ll see that much of the code is organized into functions, even though it takes fewer lines to just write it in each cell.
Download Data
Our dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on Mcauley’s website. They have a handfule of datasets that were collected from Australian users, including reviews, user/item playtime, item metadata, and bundle data. We’re most interested in the V1 User-Items dataset, which describes the number of hours that users played on each of their games.
The file comes as a .json.gzip file, so we’ll need to extract the data before we can start processing it. More on that in the next section.
For now, we can download the raw, compressed data and save it to a local file. Let’s start by defining our input URL and output paths/directories.
from pathlib import Pathroot_dir = Path.cwd().resolve().parentraw_data_dir = root_dir /'data'/'raw'raw_data_dir.mkdir(parents=True, exist_ok=True)GZIP_PATH = raw_data_dir /'ucsd_playtime.json.gz'URL ='https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'print(f"Source URL: '{URL}'")print(f"Outputting data to file: '{GZIP_PATH.relative_to(root_dir)}'")
Source URL: 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'
Outputting data to file: 'data/raw/ucsd_playtime.json.gz'
Now we can make simple use of the requests library to download the data and write the raw bytes to the output file.
Let’s take a quick peek at the data, just the first 3 lines. For display purposes, I’m going to use the textwrap.shorten() function from the Python Standard Library so that the display looks cleaner.
import textwrapwith JSON_PATH.open('r', encoding='UTF-8') asfile: n_lines =sum(1for _ infile)file.seek(0)print(f'Total lines in the file: {n_lines:,}')print('First 3 lines:')print()for _ inrange(3):print(textwrap.shorten(file.readline(), width=80))
Total lines in the file: 88,310
First 3 lines:
{'user_id': '76561197970982479', 'items_count': 277, 'steam_id': [...]
{'user_id': 'js41637', 'items_count': 888, 'steam_id': [...]
{'user_id': 'evcentric', 'items_count': 137, 'steam_id': [...]
If we took a naive approach and tried calling pd.read_json() or json.load() on this file, we’d run into an error. It may seem strange that the data is stored this way, particularly how it uses a single-quote ' instead of the JSON-required double-quote ". However, we must consider the source of this data: The dataset was likely obtained through a REST API, which is unlikely to return all the data at once (certainly not 88,310 records).
Most likely, the authors McAuley & Kang ran a script that repeatedly requested data from the API for all possible users. To save progress along the way, they likely appended each record to the JSON file after it was retrieved, preventing any loss of data if there was an error along the way.
While this helps on the original data collection side with McAuley/Kang, it creates a couple problems for us. If we want to get this data into a Pandas DataFrame, we’ll need to handle 2 sides:
Since this is a JSON-by-Line format, each record must be handled individually, splitting records by a new line separator (either '\n' or '\r\n').
The data is stored as plain-text, so reading a single line will return a string that must be converted into an appropriate Python data structure (in this case a dictionary)
A module that is helpful here is Python’s abstract syntax grammar module, ast. This module has a function ast.literal_eval() that will convert a string into the appropriate Python object. Even better, it also handles nested data well, which we’ll find useful after converting.
Let’s try evaluating a single object to see what this looks like. From some exploration on the side, I found line 51 to be a good example.
This nested structure will need to be handled in a later notebook. Fortunately, we can still keep the lists of dictionaries when we convert it to a DataFrame, since Pandas allows for nested/complex data types.
For now, let’s process the data without cleaning/expanding any of the data. Since this may take a little while, I’ll add a progress bar using the tqdm library.
import pandas as pdfrom tqdm import tqdmdef parse_json_by_line(verbose: bool) -> pd.DataFrame:with JSON_PATH.open('r', encoding='utf-8') asfile:if verbose: total =sum(1for _ infile)file.seek(0)file= tqdm(file, 'Parsing JSON-by-Line', total, colour='green') records = []for line infile: line = line.strip()ifnot line:continue records.append(ast.literal_eval(line)) df = pd.DataFrame.from_records(records)return dfdf = parse_json_by_line(verbose=True)df.info()df.head()
Processing this data took me a little over 1 minute. If we wanted to speed this up, we could look at running this in parallel. However, 1 minute isn’t horrible, and this is really a one-time process, so we can leave that out.
Save to File
Now that we have some data to start cleaning, we should save it so that we don’t have to run this process again. Typically, I’d look for a format like feather for quick reading/writing. However, this can cause problems since we have the items column as a nested list. For simplicity, we can use a pickle format to serialize our data, which has the least amount of requirements.
Wrote 488.93 MB to file: 'data/raw/ucsd_playtime.pkl'
It’s unfortunate that our file size is so massive when using pickle, but this give us much better flexibility to stop here and resume cleaning in the next notebook.