<aside> 💡 아래 포스트는 네이버 Boostcamp AI-Tech 과정 중 고려대학교 인공지능학과 최성준 교수님의 DL Basic 수업 및 자료를 바탕으로 재구성한 것입니다.

</aside>

Introduction

Deep Learning

Neural Networks

Neuron vs. Perceptron (source: https://inteligenciafutura.mx/english-version-blog/blog-06-english-version)

Artificial neural networks (ANNs), usually simply called neural networks (NNs), are computing systems inspired by the biological neural networks that constitute animal brains.[1]

Mathematical Definition

Neural networks are funciton approximators that stack affine transformations followed by nonlinear transformations.

**아파인 변환(affine transformation)**이란, $\mathbf{y} = \mathbf{Wx} + \mathbf{b}$의 형태로 일어나는 변환을 의미합니다. 벡터나 행렬을 스칼라(scalar)로 간주한다면, 1차 직선 $y=wx+b$ 과 동일한 형태입니다.
- 어떠한 벡터 $\mathbf{x} \in \mathbb{R}^n$이 있을 때, 차원이 $m \times n$인 행렬 $\mathbf{W}$가 곱해진 변환을 선형 변환(linear transformation)이라고 하며, 이때 결과는 $\mathbb{R}^n$ 차원의 벡터가 됩니다. 벡터나 행렬을 스칼라(scalar)로 간주한다면, $\mathbf{W}$에 관계없이 항상 "원점을 지나는" 직선 형태입니다.
- 이후 벡터 $\mathbf{b} \in \mathbb{R}^n$만큼이 더해진 일종의 평행이동이 일어나고, 이를 아파인 변환 혹은 아핀 변환이라고 말합니다. 선형 변환과는 성질이 유사하면서도, 차이가 존재합니다.
**비선형 변환(nonlinear transformation)**은 인공신경망이 복잡한 현상들을 모델링할 수 있도록 하는 역할을 합니다.
- 비선형 변환 없이 아파인 변환만을 중첩(stacking) 혹은 합성(composition)한다면, 사실 이는 단 하나의 선형 함수로도 표현할 수 있게 됩니다. 즉, 행렬곱에 의해 아무리 많은 함수를 쌓더라도 이는 단일의 선형함수에 불과합니다.
  
  $$ \mathbf{W}_n(\cdots (\mathbf{W}_2(\mathbf{W}_1\mathbf{x}))) = \mathbf{W} \mathbf{x} \text{, where } \mathbf{W} = \mathbf{W}n \mathbf{W}{n-1} \cdots \mathbf{W}_1 $$
- 따라서, 오늘날의 딥러닝을 가능하게 해주는 것은 바로 비선형성(nonlinearity)입니다.

Universal Approximator Theorem

"Multilayer Feedforward Networks are Universal Approximators"[2]

Universal Approximators: standard multilayer feedforwards networks are capable of approximating any measurable function to any desired degree of accuracy
There are no theoretical constraints for the success of feedforward networks
Lack of success is due to inadequate learning, insufficient number of hidden units or the lack of a deterministic relationship between input and target[3]

Sums of the form

$$ \sum_{j=1}^{N} \alpha_j \sigma(y_j^\top x + \theta_j) $$

where $y_j \in \mathbb{R}^n$ and $\alpha_j, \theta_j \in \mathbb{R}$, are dense in the space of continuous functions on the unit cube if $\sigma$ is any continuous sigmoidal function.