The second entry in the DS interview question series comes from ryxcommar. Let's put pen to paper (i.e. keyboard to LaTeX) and solve it!
Because the world needed another blog.
This is a solution to an interview question posed by Quantian on Twitter. It is the first in a series of interview questions I plan to post.
The cosine similarity is a useful distance measure for comparing NLP document vectors, but should not be used with probability distributions.
I demonstrate how to use gztool and SQLite to provide near random access to large gzip files, and enable querying by fields of interest.