The longer set of interesting points from the paper:
1) Both CBOW and skip-gram are "shallow" neural models, and the whole idea was to demonstrate that you can get better word representations if you trade the model's complexity for efficiency, i.e. the ability to learn from much bigger datasets. In the word2vec package (and Mikolov's paper), they recommend using the skip-gram model with negative sampling (SGNS), as it outperformed the other variants on analogy tasks.
2) Omer does a thorough analysis of the SGNS word embedding algorithms, and shows that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. PMI matrices are commonly used by the traditional approach to represent words, often as "distributional semantics". Intuitively, the decomposition of the positive PMI matrix (PPMI) works well because the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share (e.g. “Canada” is like “snow” versus “Canada” is not like “desert”).
4) In most cases, SGNS does a comparable job to the other methods. While their SPPMI provides a far better solution to SGNS’s objective, it does not necessarily perform better than SGNS on linguistic tasks; this could be related to SGNS down-weighting rare words, which PMI-based methods are known to exaggerate. SGNS also performs weighted matrix factorization, which may be giving it an edge in the analogy task versus SVD.