**Important note:** There are far too many papers for me to have accurately selected all of the interesting ones!
Every time I go through the list again, I find an additional set of papers for me to read.
This is dangerous as this is already a formidable list ;)

If you're interested in tackling the list of papers yourself, check out the ICLR 2017 Conference Track submissions. Bonus points if your over-eagerness to read all the papers crashes the web server again ;)

When I add (biased) it's as I'm an author on the paper or a colleague of one of the authors. I will note that I believe my bias to have a good foundation - my colleagues and I produce good work ;)

- Identity Matters in Deep Learning

The paper "put[s] the principle of identity parameterization on a more solid theoretical footing alongside further empirical progress" and their modifications "improve significantly on previous all-convolutional networks on the CIFAR10, CIFAR100, and ImageNet classification benchmarks". - On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima

The impact of mini-batch size is often overlooked. It's a reality we're just used to and rarely spare thoughts for. This was on arXiv some time ago and has since provoked interesting discussion. I think the concept of sharpness is likely to be extended to other ideas as well. - Inefficiency of Stochastic Gradient Descent with Large Mini-batches (and More Learners)
- Understanding Deep Learning Requires Re-thinking Generalization
- Categorical Reparameterization with Gumbel-Softmax

If you liked the straight through estimator, you'll likely want to upgrade to the straight through Gumbel-Softmax estimator!

- Tracking the world state with recurrent entity networks

I really like this paper - I wanted to try a similar idea out myself but not only did they beat me, they did a great job! :)

Tying word vectors has been shown insanely highly effective to language models. This advantage likely extends to other specifics tasks as well.

- Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling (biased)

Provides a theoretically driven reason for tying the input and output word embeddings (i.e. word vectors and softmax). This is related to my previous rants that making your RNN learn a one-to-one mapping between the input and output word embeddings is a very expensive and lossy operation if it's not needed (i.e. for machine translation). - Using the Output Embedding to Improve Language Models

- Improving Neural Language Models with a Continuous Cache

Independently of my work with Pointer Sentinel Mixture Models, Grave et al. implement a similar idea but with some interesting differences. They use WikiText (yay!), likely as they want more data and larger windows of previous context (2000 words!) and show the method provides substantial drops in perplexity as they scale the window larger. - Variable Computation in Recurrent Neural Networks
- Frustratingly Short Attention Spans in Neural Language Modeling

"Training neural language models that efficiently take long-range dependencies into account seems notoriously hard and needs further investigation". I've many times ranted that most language modeling experiments only work in blocks of at most 35 timesteps - which is hilariously small. "This led to the unexpected main finding that a much simpler model which simply uses a concatenation of output representations from the previous three-time steps is on par with more sophisticated memory-augmented neural language models." - TopicRNN: A Recurrent Neural Network with Long-range Semantic Dependency

Substantial jump over previous IMDb state of the art - 93.72%. - Pointer Sentinel Mixture Models (biased)
- Quasi-Recurrent Neural Networks (biased)

The "sparse things are better things" category:

- Training Long Short-Term Memory with Sparsified Stochastic Gradient Descent

Not directly applicable yet but it involves Nvidia researchers and they explicitly note that "These redundant MAC operations can be eliminated by hardware techniques to improve the energy efficiency and the training speed of LSTM-based RNNs" ;) - Exploring Sparsity in Recurrent Neural Networks

Baidu have a long history of optimizing RNNs. This work reduces the size of RNN weights by up to 90%, resulting in a speed-up of 2-7x. This is also highly useful for mobile where memory accesses (and hence model size) are a primary drain on battery life (h/t to Mat Kelcey for telling me that).

Training recurrent neural networks is still fraught with terror. I've written previously about orthogonality in RNN weights. These works explore the recurrence within RNNs through these lenses.

- A Recurrent Neural Network without Chaos

- On Orthogonality and Learning Recurrent Networks with Long Term Dependencies
- Rotation Plane Doubly Orthogonal Recurrent Neural Networks

- A Convolutional Encoder Model for Neural Machine Translation

The convolutional source encoder speeds up CPU decoding by two times with no loss of accuracy compared to a strong bi-directional LSTM baseline. This matches well with Fully Character-Level Neural Machine Translation without Explicit Segmentation by Lee et al. who speed up their NMT architecture in similar ways.

- Neural Architecture Search with Reinforcement Learning

Quite interesting work involving an RNN being used to construct a CNN hierarchy or the core RNN cell. While some of it involves some specific tuning (i.e. which variables the RNN has access to from previous time steps) it's promising and seems effective if you've the spare CPUs or GPUs. If you're interested in this, also check out An Empirical Exploration of Recurrent Network Architectures. It primarily reminded us that a forget bias of 1 is a darn good idea - something forgotten since the 90's - but also found a few variations to traditional RNNs through architecture search. - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Google use billions of parameters to achieve state of the art in both machine translation and language modeling. By being intelligent as to when to use parts of the model, the overall amount of computation is still quite tractable however! - Playing SNES in the Retro Learning Environment
- A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks (biased)
- Dynamic Coattention Networks For Question Answering (biased)
- LSTM-Based System-call Language Modeling and Robust Ensemble Method for Designing Host-based Intrusion Detection Systems

There will likely be a number of deep learning computer security papers in the coming year. I'm looking forward to an interesting DefCon ;) - SqueezeNet: AlexNet-level Accuracy with 50x Fewer Paramters and <0.5MB Model Size

Like the Sparsity in RNN paper above, model size is hugely important for mobile applications not just due to size (download + store + RAM) but also battery life. - Fine-grained analysis of sentence embeddings using auxiliary prediction tasks

Another paper that was previously on arXiv but if you haven't seen it, you should. This work shows the advantages and shortcomings of sum(BoW) and RNN sentential representations with a few surprises.

- It's ML, not magic: machine learning can be prejudiced
- It's ML, not magic: the rise of AI-prefix investing
- In deep learning, architecture engineering is the new feature engineering
- How Google Sparsehash achieves two bits of overhead per entry using sparsetable
- Where did all the HTTP referrers go?

**Interested in saying hi? ^_^**

Follow @Smerity