Once Upon A.I.: Building a Better Recommender, Part Two — Single Word Queries

In my last blog post, I detailed my project urStory which is a recommendation engine for public domain short stories built on the natural language processing software library spaCy. While I was able to demo the project with a Flask front end by the end of my project deadline there were still some areas of needed improvement. The project as it stands mainly exists as a proof of concept, something to improve on as a passion project.

urStory allows users to play with a data set of 20,000 stories by turning their queries into word vectors. For example, “Aladdin has three wishes granted by a genie” would return stories with similar vectors, according to cosine similarity. As the query function was thought up and added in towards the end of the project, there was a few oversights. The biggest one is that querying just for “Aladdin” wouldn’t return Aladdin stories, but instead, bibliographic material and lists of names! The word vector for the query must have been most similar to other collections of isolated names.

The query function would mostly return lists of names

To solve this problem, I tried two approaches. The first was to turn each title into a vector and perform cosine similarity between the single word query and the vectorized title. The second was to check if the word was in any of the story titles and to return similar stories to the first result.

Title Vectors

I tried this initial to see how precise the title vectors could be. One would hope that a vector for “Aladdin” would be most similar to titles with “Aladdin” in them or at least with other names.

The Vector query did not return the Aladdin story.

Unfortunately, this was not the case. “Aladdin” was not in the results, and it was difficult to understand how the model chose these stories.

Boolean Results

The Boolean approach gave correct results. For this, a single word query would first check if the word was present in any of the titles. The word in title query would then return the most similar stories to the initial result of that boolean.

The Boolean query function returned all of the Aladdin stories.

The Boolean query performed well, but left one issue: how to handle words outside of the corpus?

Words Outside of the Corpus

Words outside of the corpus would throw index errors whenever the single word query function performed. Because I had already turned the titles into vectors, I decided to compare the performance of story vector versus title vector when querying a word outside of the corpus.

Although the two approaches had their differences in results, at the end of the day, story vector approach would be the most optimal, as it had the least mistakes and wouldn’t necessitate the addition of another vector column in the data frame and to the website app.

Further Areas of Improvement

While this approach handles single word queries well, one area to investigate would be queries that do not have verbs, meaning they are not complete sentences. For this, I would need to save my current spaCy model differently than I had before, as the current one does not have its pos_tagger. I could also add labels for characters in the pipeline as well.

Next week I will be improving the website’s results function. Stay tuned!

Data Scientist and Writer, passionate about language