/** ***/**\**|******/**

Machine learning explorations

*\/**/

author: the_wetware @ kjx.cz

warning: Only theoretical musings about fundamentals follows, nothing of direct practical value.

disclaimer: I am curious outsider with backround in very loosely related disciplines just asking stupid questions.

What is status quo of hardware for neural networks?

At the outset my insight was based on supposed thruths the internet provided:

GPUs are used for neural networks nowadays
NVidia Titan is the best

I was curious why. After few evenings of googling I found out a bit more and ended up giving a short talk about it.

[ slides - pdf ] [ slides - beamer/latex src ]

Why gradient descent?

From my limited understanding it seems that gradient descent or it's modification (stochastic GD, momentum-based methods) are immensely popular for neural network training. From my studies I remember that gradient descent was definitely not the only iterative extrema searching algorithm, nor the most powerful. Why not Newton or conjugate gradient method? I am curious why and will gather interesting info below.

It seems like at least conjugate gradient method sporadically receive some credit.

What do we really expect when talking about generalization and overfitting?

Theore about neural networks is usually fine up to the point when overfitting is mentioned. After that some ugly ad-hoc empirical band-aid is applied (dropout, regularization). As usually I am curious what would be proper (more precise) treatment of generalization.

Where is the neural network magic hidden?

Neural network is a non-linear function R^n -> R^m and supervised learning might then be understood as search for the closest approximation of function mapping input data -> your output data (on training set).

In other kind-of-similar problems the approach is different.

Finite elements method for (differential differential equations solving)

You choose discrete regions of your input parameters (and approximate the solution region by region).
You choose (usually very limited) sub-space of all possible functions on that region (e. g. polynoms up to multi-quadratic).
Method gives you linear combination of some base of your chosen sub-space.

Function approximation (Taylor series, Fourier analysis)

You choose some finite dimensional subspace of all conceivable functions.
Method gives you coefficients of linear combination of base elements of your subspace.

Both these methods uses addition ("+") as a way how to aggregate primitives (simple functions) while neural networks use function composition.

Would it be possible to use composition of e. g. linear functions to obtain any interesting result?


	y = w_2 * (w_1 * x_1 + b_1) + b_2

In other words - is "the magic" of neural networks hidden in their activation functions (sigmoid, hyperbolic tgz, ...) or their structure (function composition instead of linear combination)?

How old are machine learning ideas actually?

Neural networks and machine learning in general seems like the hot new tech. Is it really so?

gradient descent - 1847: Méthode générale pour la résolution des systèmes d'équations simultanées; Cauchy, Augustin; 1847
perceptron - 1958: Frank Rosenblatt; 1958
logistic regression - 1958: The Regression Analysis of Binary Sequences; Cox, D. R.; Journal of the Royal Statistical Society. Series B; 1958
support vector machine - 1963: Pattern recognition using generalized portrait method; Vapnik, V., Lerner, A.; Automation and Remote Control; 1963
convolutional neural network - 1980: Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position; Fukushima, Kunihiko; Biological Cybernetics 36; 1980
dropout (regularization) - 1989: Optimal brain damage; Yann Le Cun, JS Denker, SA Solla; NIPS'89 Proceedings; 1989