To complete my Data Science coursework as part of the Flatiron School, I built a recommendation engine for public domain short stories called urStory. For this project, I was inspired by the original aim of the Brothers Grimm’s collection of fairy tales, to perform linguistic and cultural research on the German vernacular. I wanted to emulate their work, as well as the work of countless narrative scholars, and build my own dataset and analytic framework, which would allow me to analyze short stories with Data Science. To do this meant scraping Gutenberg.org and cleaning 20,000 public domain short stories from over a thousand books.
Using the natural language processing toolkit spaCy, the recommendation engine translates a user query to a vector which is then compared to the vectors of the story corpus using cosine similarity. A word vector follows the idea that a word’s definition is built on the company it keeps. The document vector for each story would comprise an average of the word vectors it includes. Four stories are then selected from each of three partitions from the top 36 most similar results , giving users a total of 12 stories choose from. When a story is selected, the story text is presented, followed by another 12 stories that are similar to the one just read as well as the querying tool.
The initial feedback for the project was overwhelmingly positive; users were delighted by the ability to search for familiar short stories by typing in a plot synopsis for stories they knew like “Aladdin gets three wishes granted by a genie,” and by the engine’s ability to unearth forgotten stories when they typed in a sentence about what they did that day like “I ate a scallion.”
However, some of these initial user tests revealed areas of needed improvement. While most were primarily aesthetic, the largest shortcoming of the recommendation engine is that it is for now exclusively content-based, meaning there is no way of involving the history of a user’s inputs and preferences in the recommendations provided (as well as those of similar users).
Though maintaining serendipity and chance in the results is a major concern (most users, myself included, hate the echo chamber effect of most recommenders), involving user input should at least be pursued. In the next few months, I plan on creating a relational database on the back end and some means of allowing users to rate and/or like stories as well as create different collections like a “to read” list and a “favorites” list, both of which could help improve the results of the query function. Down the road, I would also love to involve the Twitter API or some kind of News API to create a ‘trending’ list of recommendations based on the hot topics of that day.
While these are larger goals, there are also several minor areas that will be tackled in the coming weeks. The first of which being that the recommendation engine does not handle single-word queries. For example, a query for “Aladdin” does not return the story “Aladdin and the Wonderful Lamp” as would be expected but bibliographic material, lists of names that comprise many of the books’ back end! One reason for this is because the vector for the user input of “Aladdin”, having no verbs etc. is only understood a proper name, thus the vector for the query “Aladdin” would be most similar to names and not the story of “Aladdin.”
In my next blog post I will detail this process of improving the query function to include non-sentence/single word queries. I will compare the performances of vectorized titles as well as similarity to the results of a Boolean. Stay tuned!