webamp/packages/skin-database/docs/database.md
2022-06-19 16:20:57 -07:00

10 KiB

Database

The Winamp Skin Museum, and related projects (Twitter bot, Discord bot), are powered by an SQLite database which is not included in this repository, but a copy can be obtained by reaching out directly.

For reference, the tables/colums are as follows:

algolia_field_updates

The search feature on skins.webamp.org is powered by Algolia third-party search. Data about each skin is pushed up to Algolia as it changes. In order to make re-indexing (pushing) efficient, in the case that we want to invalidate some values, here we track each time we update a field in Algolia.

  • id A unique ID for this update
  • update_timestamp The time at which this field was pushed to Algolia
  • field The field in the search index that was updated
  • skin_md5 The md5 hash of the skin (see the skins table)

instagram_posts

For a while we were posting skins to Instagram trying to replicate the success of the Twitter bot. It didn't catch on, so we stopped. However, we have an index of all the posts that were made.

  • id A unique ID for this post (local to this database)
  • skin_md5 The md5 hash of the skin that was posted (see the skins table)
  • post_id The Instagram ID of this post
  • url The URL of this Instagram post

screenshot_updates

The Winamp Skin Museum's main feature is uniform screenshots of each classic Winamp skin. These screenshots are generated using a script that Puppeteer and webamp.org. If we make changes to that script, or fix bugs in Webamp, we may want to retake these screenshots and reupload them to S3. Similarly, we may want to track skins which fail to render or otherwise work in the screenshot script. This table records each time we retake the screenshots so that we can inteligently decide if we want to retake the screenshot at any given time.

  • id A unique ID for this update
  • update_timestamp When the screenshot was taken
  • skin_md5 The md5 hash of the skin that was updated (see the skins table)
  • success (bool) Was this screenshot/update successful?
  • error_message (string) Error message encountered when taking the sceenshot

archive_files

Each skin is actually a zip file. As part of the database, we examine the contents of each zip archvive and record metadata about the inner files. Each row in this table represents a file found within a Winamp skin's zip archive.

  • id A unique ID for this file
  • skin_md5 The md5 hash of the skin that this file was found within (see the skins table)
  • file_name The file path of this file within the archive
  • file_md5 The md5 hash of the file after being extracted from the zip
  • file_date The date that the file was created (according to the zip metadata)
  • uncompressed_size The size of the file, afterbeing decompressed Deprecated, see the file_info table.
  • text_content If the file is a text file, this column contains that text Deprecated, see the file_info table.
  • is_directory (bool) Is this file a directory?

key_value

For some cron jobs and async tasks we want to track arbitrary state and efficiency is not a concern. For these quick things we have a key/value store where the value is often a json blob, and we just overwrite the whole thing each time.

key - I'll give you one guess... value - I think you can see where I'm going here

skin_reviews

Before a skin can be tweeted by the Twitter bot, it must firt be approved by someone in the Discord server. This helps keep the quality bar somewhat high (fewer low effort skins getting shared) and also helps us avoid sharing NSFW skins on Twitter.

Additionally, skins marked as NSFW are blurred and down-ranked on the Skin Musuem.

These are all human reviews. Note that the bot will create both a NSFW and REJECTED review when a skin is marked as NSFW.

  • id A unique ID for this review
  • skin_md5 The md5 hash of the skin being reviewed (see the skins table)
  • review One of REJECTED, APPROVED, NSFW
  • reviewer (Often missing) the user who did the review

file_info

Metadata about files extracted from zip files. Because many skins contain the same files, or some files are duplicated in many skins, we normalize them here so that we only have one row for each file, no matter how many skins it appears within.

  • id A unique ID for this file
  • file_md5 The md5 hash of the file after being extracted from the zip (see the archive_files table)
  • file_date The date that the file was created (according to the zip metadata)
  • size_in_bytes The size of the file, afterbeing decompressed
  • text_content If the file is a text file, this column contains that text

skin_uploads

When a user attempts to upload a file at the Museum, they first request an upload URL so that they can upload the file directly to S3. Then, once they have uploaded it, they notify our server, and we kick off a job to download the skin from S3 and screenshot/scrape it as we have capacity. This approach lets us scale to an unlimited number of uploads during a spike in traffic, and we will process them at our leisure.

The status of an in-progress upload is tracked in this table, and it is used by the server as a task queue to track skins that need to be processed.

First a user requests a URL for a skin (based on its md5) (URL_REQUESTED) once they've uploaded to S3, they notify us (UPLOAD_REPORTED), finally the file is processed and it ends up as either ERRORED or ARCHIVED (success!)

  • id A unique ID for this upload attempt. This is used in the S3 filename that the user is given permission to create.
  • skin_md5 The md5 of this uploaded file. Note that not all skins here will end up in the skins table. Either due to uploads not completing, or processing error, or the file is not actually a skin.
  • status Where in the pipeline is this upload: ERRORED, UPLOAD_REPORTED, URL_REQUESTED, ARCHIVED.
  • filename The filename that the user had for the file when they uploaded it (files on S3 file name is based on id, so we need the filename here.

files

Information about skin files that we have injested. Since skins are indexed by their md5 content, we may have encounted the same skin file under multiple filenames. This table shows all the filenames we've encountered for each skin, and (in some cases) where we got the file.

  • id A unique ID for this file
  • file_path The file path (directory and name) of the skin file
  • skin_md5 The md5 hash of the file (see the skins table)
  • source_attribution Where we found this file (if known)

skins

Information about a given Winamp skin. Each item in the Winamp Skin Museum corresponds to a row in this table with a skin_type of 1 (classic).

  • id A unique ID for the skin (not really used, though it should be. Instead every other table references skins by md5)
  • md5 The md5 hash of the skin file. Most other tables reference skins by this value. Usually with a column named skin_md5 We should probably fix that and use id
  • skin_type One of 1 (classic), 2 (modern), 3 (pack), 4 (invalid)
  • emails A space-separated list of emails extracted from the skin's text files
  • readme_text Using a herusitic, we identify files in the skin archive that are likely to be readme or readme-like files and index them here. This should probably be done dynamically at query time and we should find this value in the file_info table.

ia_items

Each skin in the Museum should be persisted to the Internet Archive for preservation. This is done daily and we keep a local cache of what information the Internet Archive has about each skin.

  • id A unique ID for this internet archive item (local to this database)
  • skin_md5 The md5 hash of the skin (see the skins table)
  • identifier The unique identifier used by the archive for this item (this is used in the URL of the item and for API queries)
  • metadata A JSON blob contanining the Internet Archive's metdata about the item
  • metadata_timestamp The last time we scraped the metadata from their API (I think)

refreshes

Much of the data in the database is derived from the skin archives themselves. However, we don't have all the skin files locally, they are in S3. So, we only periodically download them and re-extract all the data/screenshots. This table records each time we do this. This way we can know which skins need to be refreshed.

  • id A unique ID for this refresh
  • skin_md5 The md5 hash of the skin (see the skins table)
  • error Any error we encountered during the refresh
  • timestamp When we performed the refresh

tweets

The Twitter bot @winampskins tweets. Likes and retweets are scraped nightly, but can only read the most recent tweets. Like/retweets on older tweets are not seen by us, so the numbers represent a lower bound. Additionally, the Twitter API only lets you go back so far, so we may be missing some Tweets, since we didn't index these from the very begininig.

Finally, not all tweets that the bot tweets are scritly skins. Some are manual tweets or retweets. So, not every tweet will have a skin_md5.

  • id A unique ID for this tweet (local to this database)
  • likes The number of likes the tweet got
  • retweets The number of retweets the tweet got
  • skin_md5 The md5 hash of the skin that was tweeted (see the skins table) Note Not all tweets reference a skin
  • tweet_id The ID for this tweet, as assigned by Twitter. This can be used to construct the tweet URL

knex_migrations

Metadata about migrations that have been run on the database. Used for making database changes

knex_migrations_lock

Used to ensure migrations are applied correctly.

museum_sort_overrides

Used for making editorial decisions about how individual skins show up in the main scroll of the Winamp Skin Museum. Used for boosting the default skins and hiding aparent duplicates. In reality there are many many near or actual dupes, but we manually cull duplicates that appear in the first few pages.

  • id A unique ID for this override (local to this database)
  • skin_md5 The md5 hash of the skin that is being overridden
  • score A score for how highly rated this skin should be. Negative numbers mean the skin should be hidden.
  • comment Explains why the skin was ranked this way