What is this website?

I created this website to make it easier to find Japanese content (like dorama, anime, etc.) based on its difficulty. I was frustrated because I felt the list of content on other sites, like jpdb, was too short. So, I decided to build something similar myself, but with a much bigger database of media.

What functions does this website have?

It's a big, searchable list of Japanese content sorted by difficulty. You can use different filters to find what you want.

My main goal was just to search by difficulty, so you won't find things like Anki integration or spaced repetition (SRS) features. I'm just focused on the difficulty search for now. Maybe I'll think about adding those other features in the future.

What kind of media is on here?

It has all kinds of Japanese media. It starts with the obvious stuff like anime TV series and anime movies, but I've also included Japanese documentaries, history movies, random dorama (live-action dramas), and even some YouTube content. I plan to keep adding more.

How did you create this database?

I wrote a script that automatically analyzed the entire database of subtitles from the kitsunekko-mirror GitHub repository. After it finished running, I ended up with over 9,000 different entries for all kinds of media, which is what you see on the site.

How accurate is the analysis?

It's pretty accurate, I'd say about 99%. It took me a long time to get it right; I made over five different versions of the script.

The main problem was that every single subtitle file on kitsunekko-mirror uses a different style or format. I first had to write code to clean and format all of them before I could even start analyzing the words.

If you're interested, you can find all the original scripts I used on this project's GitHub repository. Feel free to check them out or improve them.

How does the script work (in more detail)?

The script analyzes every word in the subtitle files for a show. First, it's smart enough to filter out all the "junk" words that don't count as real vocabulary, like:

  • Proper names (like "Tanaka" or "Tokyo")
  • Sound effects (like ざっ or ミーミー)
  • Interjections (like あっ or ええ)
  • Numbers
  • Foreign words

After it filters all that out, it's left with a clean list of "real" vocabulary—like nouns, verbs, and adjectives.

For every "real" word, it uses a frequency library (wordfreq) to check how common or rare it is in the general Japanese language. It then gives each word a "rarity score."

The "Vocab Density %" is the main number I recommend using. It's calculated by counting all the words that pass a certain "rarity" threshold (in the script, this is DIFFICULTY_THRESHOLD = 5.0).

In simple terms: this percentage tells you how many "difficult" or "rare" words you can expect to find in the show. A low percentage (like 5%) means almost all the words are common and the show is easy. A high percentage (like 20%) means the show uses a lot of rare vocabulary and is much more difficult.

How do you recommend I use this website?

To understand how hard a show is, I really recommend you look at the "Vocab Density %" number. I believe this is the most accurate way to judge difficulty.

You can, of course, look at other things like "Kanji Difficulty" or "Vocab Difficulty (1-100)," but the "Vocab Density %" is the most direct and accurate measurement of how many rare words you'll run into.

What features are you planning to add in the future?

I would really love to add Anki integration in the future. The idea would be that you could one-click-add all the words from a show right into your Anki deck.

I don't think it would be that hard to do, but it would require me to completely re-process my entire database and change how the data is structured. So, it's something I'll definitely consider doing, but it's a future plan.