Posts /

t-SNE 高位数据可视化

Twitter Facebook
15 Nov 2016

t-distribution:

t-分布通常用于从小样本估计总体呈正态分布且方差未知的整体的均值。如果总体的方差已知,例如在样本数量足够多时,应该用正态分布。

它是对两个样本均值差异进行显著性测试的学生t检定的基础。

t-SNE 即 t-distributed stochastic neighbour embedding, 也是一种流体学习方法(manifold learning),通过保持数据点的相邻关系把数据从高维空间中降低到 2 维平面上,对高维数据的可视化效果非常好。

流体学习方法 manifold learning:

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

关于详细的 manifold learning 的方法比较,可以看 sklearn 的官方指南,写得非常好。

# That's an impressive list of imports.
import numpy as npf
from numpy import linalg
from numpy.linalg import norm
from scipy.spatial.distance import squareform, pdist

# We import sklearn.
import sklearn
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

# We'll hack a bit with the t-SNE code in sklearn 0.15.2.
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.manifold.t_sne import (_joint_probabilities,
                                    _kl_divergence)
from sklearn.utils.extmath import _ravel
# Random state.
RS = 20150101

# We'll use matplotlib for graphics.
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects

# We import seaborn to make nice plots.
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette('muted')
sns.set_context("notebook", font_scale=1.5,
                rc={"lines.linewidth": 2.5})

# We'll generate an animation with matplotlib and moviepy.
from moviepy.video.io.bindings import mplfig_to_npimage
import moviepy.editor as mpy

Twitter Facebook