I just finished a novel written by Patricia Lockwood called No One is Talking About This, a genre-defying book that begins embedded in the social media experiences of an extremely online protagonist and ends with the protagonist’s sisters birth of a child with proteus syndrome, a one in a billion chance. To diagnose this syndrome prenatally, the doctors must perform a exome sequencing of her DNA. The doctors tell the family will be like looking for “a single misspelling in a single word on a single page of a very long book.”

It reminded me of one of the uses…


If you saw my blog post last week, you’ll know that I’ve been completing LaylaAI’s PySpark Essentials for Data Scientists course on Udemy and worked through the feature selection documentation on PySpark. This week I was finalizing my model for the project and reviewing my work when I needed to perform feature selection for my model.

Unlike LaylaAI, my best model for classifying music genres was a RandomForestClassifier and not a OneVsRest. Surprising to many Spark users, features selected by the ChiSqSelector are incompatible with Decision Tree classifiers including Random Forest Classifiers, unless you transform the sparse vectors to dense…


This past week I was doing my first machine learning project for Layla AI’s PySpark for Data Scientist course on Udemy. While in the first lessons for classification, the accuracy scores for the models had stellar results, when using the same data cleaning, normalization, and scaling, the accuracy for these models was absolutely dismal.

Whenever I did machine learning projects with scikit-learn in Python, I would do the feature selection and polynomial transformations in a more hands-on manner, that is whenever I wasn’t doing an NLP projects. In some cases for Big Data projects, you might be working with 7,000…


The Audimeter, or how the Nielsen company first got its television ratings

A book that first got me interested in pursuing Data Science was Claude C. Hopkins’ Scientific Advertising. Hopkins pioneered the use of statistical testing and test campaigns in the field of advertising. While Hopkins focused on print advertising and supermarket mailers, many of his followers like David Ogilvy would expand his concepts to television and later to the internet. Tracking media consumption was essential to how publications and channels would value their advertisement. …


“The Unicorn in Captivity”

Last fall, after my first trip to the Met Cloisters, I started learning more about the Unicorn Tapestries, a seven tapestry series over 500 years old, with each tapestry comprising tens of thousands of threads. The seven tapestries tell the seemingly simple story of a Unicorn’s capture, captivity, and murder and have entranced visitors for generations. Though a friend had mentioned that they were an allegory about the stations of the cross, much of the imagery found in them can be interpreted to contradict this — like the detail of the unknown botany (perhaps of extinct plant varieties or products…


How to Go with the MLFlow: Tracking Tutorial.

Anyone who has worked on a professional or personal machine learning project will know that keeping track of the performance and evaluations of each of your models can get particularly messy. In my case, I can somewhat fondly recall my bleary-eyed scrolling through a finalized jupyter notebook at the eleventh hour to ensure I created a model from each relevant algorithm and tuned each one’s hyperparameters properly.

MLFlow is an opensource framework released by databricks in 2018, the developers who created the Apache Spark project, to help users keep track of the…


While learning about classification algorithms in PySpark’s MLLib, I came across an algorithm I had not used in SciKit Learn, the One-vs-rest classifier.

The one-vs-rest classifier, or one-vs-all, splits a multi-class classification problem into several binary classification problems with a model for each class. For the popular orchid dataset, this would mean that each type of orchid would have a model. The classifier for class i is trained to predict if the label is label i or not and the final assignation output is given by the label of the most confident classifier. …


If you’ve been following my blogposts in the past you’ll know that I’ve been pursuing a certification in Data Science with PySpark using LaylaAI’s course on Udemy, PySpark Essentials for Data Scientists (Big Data + Python). Currently, the course is covering how to create classification models using PySpark’s own Machine Learning Library, or MLLib.

As I’ve covered in the past, as difficult as adjusting to a MapReduce framework maybe, there is much that PySpark has in common with libraries like Pandas as well as the syntax for the querying language SQL. Similarly if you’re familiar with classification algorithms from, you…


The Jack Cade Rebellion from a production of Henry VI at Shakespeare’s Globe (Production photo credit: Marc Brenner)

Five years ago, Oxford controversially republished the first three plays of the minor Henriad with Christopher Marlowe as the coauthor. The scholarship may befuddle fans of the bards, unless they know about Natural Language Processing.

Shakespeare’s Henry VI plays have always occupied a curious place in his canon. Although they are not among the most read of his works, many have chiefly derived interest from them to understand how the upstart crow with “a tiger’s heart wrapped in a player’s hide” developed his talent to become a pillar of world literature. The influence of the playwright Christopher Marlowe has long…


If you’ve been following my previous blog posts, you’ll know I’ve been making my way through Stephen Grinder’s SQL and PostgreSQL: The Complete Developer’s Guide course on Udemy. For the past week, Grinder’s course has described how to model certain features found on the app Instagram in an SQL schema. As many of these ideas were on the theoretical side, I really enjoyed this section and wanted to summarize some of the takeaways I found while completing this section.

Likes

A user of Instagram might be surprised that the number of likes a post has is not stored as a…

Anton Haugen

Data Scientist and Writer, passionate about language

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store