Insights on word2vec

My group has been talking a lot about the intuition behind word2vec recently, and we read a short paper that investigated some of the reasons it might work so well. There is a NIPS paper on this with really nice analysis, and a nice more practically-focused follow-up.
The TL;DR summary from the author of both papers is that word2vec through SGNS is "doing something very similar to what the NLP community has been doing for about 20 years; it's just doing it really well".

The longer set of interesting points from the paper:
1) Both CBOW and skip-gram are "shallow" neural models, and the whole idea was to demonstrate that you can get better word representations if you trade the model's complexity for efficiency, i.e. the ability to learn from much bigger datasets. In the word2vec package (and Mikolov's paper), they recommend using the skip-gram model with negative sampling (SGNS), as it outperformed the other variants on analogy tasks.

2) Omer does a thorough analysis of the SGNS word embedding algorithms, and shows that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. PMI matrices are commonly used by the traditional approach to represent words, often as "distributional semantics". Intuitively, the decomposition of the positive PMI matrix (PPMI) works well because the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share (e.g.  “Canada” is like “snow” versus “Canada” is not like “desert”).

3) They experimented with different association metrics based on the same idea (e.g., a Shifted PPMI - SPPMI), and alternative matrix factorization methods (e.g., SVD). They evaluated the word representations on four datasets, covering word similarity and relational analogy tasks. The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation with the human ratings.

4) In most cases, SGNS does a comparable job to the other methods. While their SPPMI provides a far better solution to SGNS’s objective, it does not necessarily perform better than SGNS on linguistic tasks; this could be related to SGNS down-weighting rare words, which PMI-based methods are known to exaggerate. SGNS also performs weighted matrix factorization, which may be giving it an edge in the analogy task versus SVD.

5) An nice intuitive note that SGNS learn finer-grained vectors than CBOW only when one trains over more data. The main reason is that the CBOW smooths over a lot of the distributional statistics by averaging over all context words while SGNS does not. With little data, this "regularizing" effect of the CBOW turns out to be helpful, but since data is the ultimate regularizer the SGNS is able to extract more information when more data is available.