/** ***/**\**|******/**

Machine learning explorations

*\/**/
author: the_wetware @ kjx.cz
warning: Only theoretical musings about fundamentals follows, nothing of direct practical value.
disclaimer: I am curious outsider with backround in very loosely related disciplines just asking stupid questions.

What is status quo of hardware for neural networks?

At the outset my insight was based on supposed thruths the internet provided:

I was curious why. After few evenings of googling I found out a bit more and ended up giving a short talk about it.

[ slides - pdf ] [ slides - beamer/latex src ]

Why gradient descent?

From my limited understanding it seems that gradient descent or it's modification (stochastic GD, momentum-based methods) are immensely popular for neural network training. From my studies I remember that gradient descent was definitely not the only iterative extrema searching algorithm, nor the most powerful. Why not Newton or conjugate gradient method? I am curious why and will gather interesting info below.

It seems like at least conjugate gradient method sporadically receive some credit.

What do we really expect when talking about generalization and overfitting?

Theore about neural networks is usually fine up to the point when overfitting is mentioned. After that some ugly ad-hoc empirical band-aid is applied (dropout, regularization). As usually I am curious what would be proper (more precise) treatment of generalization.

Where is the neural network magic hidden?

Neural network is a non-linear function R^n -> R^m and supervised learning might then be understood as search for the closest approximation of function mapping input data -> your output data (on training set).

In other kind-of-similar problems the approach is different.

Finite elements method for (differential differential equations solving)

Function approximation (Taylor series, Fourier analysis)

Both these methods uses addition ("+") as a way how to aggregate primitives (simple functions) while neural networks use function composition.

Would it be possible to use composition of e. g. linear functions to obtain any interesting result?

y = w_2 * (w_1 * x_1 + b_1) + b_2

In other words - is "the magic" of neural networks hidden in their activation functions (sigmoid, hyperbolic tgz, ...) or their structure (function composition instead of linear combination)?

How old are machine learning ideas actually?

Neural networks and machine learning in general seems like the hot new tech. Is it really so?

gradient descent - 1847
Méthode générale pour la résolution des systèmes d'équations simultanées; Cauchy, Augustin; 1847
perceptron - 1958
Frank Rosenblatt; 1958
logistic regression - 1958
The Regression Analysis of Binary Sequences; Cox, D. R.; Journal of the Royal Statistical Society. Series B; 1958
support vector machine - 1963
Pattern recognition using generalized portrait method; Vapnik, V., Lerner, A.; Automation and Remote Control; 1963
convolutional neural network - 1980
Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position; Fukushima, Kunihiko; Biological Cybernetics 36; 1980
dropout (regularization) - 1989
Optimal brain damage; Yann Le Cun, JS Denker, SA Solla; NIPS'89 Proceedings; 1989