数据处理

img

类型扩展

img

我们先准备下数据,以波士顿房价为例,不过我们不用MLJ@load_boston了,因为我们有许多工作需要DataFrame来完成

using MLJ
using TableView # for showtable

using RDatasets
boston = dataset("MASS", "Boston");
y, X = unpack(boston, col -> col == :MedV, col -> col != :MedV) # MedV 平均房价特征

Tips

用unpack拆包数据集,可以分别用函数指定需要的数据集

科学类型

科学类型介绍

MLJ扩展出了一系列类型来更好地解释数据集,这种类型叫做科学类型 科学类型为模型和指标的搜索与查询提供了便利

模型

models(matching(X,y))
models(matching(X))

julia> info("RidgeRegressor", pkg="MLJLinearModels")
...
# 输入数据的科学类型
 input_scitype = Table{_s23} where _s23<:(AbstractArray{_s25,1} where _s25<:Continuous),
 # 输出数据的科学类型
 target_scitype = AbstractArray{Continuous,1},
...

指标

measures(matching(y))

julia> info(l1)
absolute deviations; aliases: `l1`.
...
# 虽然也是target,但这个其实是输入的数据类型
 target_scitype = Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}},
...

查看科学类型

在数据分析中用到科学类型最多的类型有两种,一种是无限数据Infinite,另一种是有限数据Finite 这里有更详细的资料 Infinite

  1. Continuous 连续数据(其实是小数)

  2. Count 计数数据(其实跟上面的差不多,只不过是整数) Finite

  3. OrderdedFactor 有序的分类数据,像是["bad", "soso", "good"]这样,可以比较

  4. Multiclass 无序的分类数据,像是["Julia", "Rust", "Clojure"]这样,没有任何联系

通常对数据集(带有特征字段的命名元组和DataFrame)采用schema schema(boston)

_.names_.types_.scitypes
CrimFloat64Continuous
ZnFloat64Continuous
IndusFloat64Continuous
ChasInt64Count
NOxFloat64Continuous
RmFloat64Continuous
AgeFloat64Continuous
DisFloat64Continuous
RadInt64Count
TaxInt64Count
PTRatioFloat64Continuous
BlackFloat64Continuous

对没有特征字段的数据可以采用scitype scitype([1,2,3]) AbstractArray{Count, 1}

修改科学类型

修改科学类型用coerce,或可以用原地修改的coerce! 等等,为什么要修改科学类型? 分析数据时,区分

  1. 数据如何编码(例如Int),以及

  2. 应该如何解释数据(例如,类标签,计数等)

如何被编码的数据将被称为机器类型而数据应如何解释将作为被称为科学型(或scitype)

但是,在许多其他情况下,可能会有歧义,我们在下面列出一些示例:

  1. Int向量例如[1, 2, ...],应将其解释为分类标签,

  2. Int向量例如[1, 2, ...],应将其解释为计数数据,

  3. String向量["High", "Low", "High", ...],应将其解释为有序的分类标签,

  4. String例如的向量["John", "Maria", ...],应将其解释为无序的多分类数据

  5. 浮点向量[1.5, 1.5, -2.3, -2.3],应将其解释为分类数据(例如,某些设置的几个可能值)等。

为了了解决这种歧异,并更好的对数据集作出解释,我们可以手动修改数据集的科学类型 承接上面的例子

X = (col_1 = [1,2,3],
	 col_2 = [1,2,3],
	 col_3 = ["High", "Low", "High"],
	 col_4 = ["John", "Maria", "Mike"],
	 col_5 = [1.5, 1.5, -2.3, -2.3])
schema(X)	 

_.name_.types_.scitypes
col_1Int64Count
col_2Int64Count
col_3StringTextual
col_4StringTextual
col_5Float64Continuous
Xhat = coerce(X, :col_1 => OrderedFactor, # 这里的分类数据用OrderedFactor来做个例子好了
	      :col_2 => Count, # 可有可无
		  :col_3 => OrderedFactor,
		  :col_4 => Multiclass,
		  :col_5 => Multiclass) 
schema(Xhat)

_.names_.types_.scitypes
col_1CategoricalValue{Int64,UInt32}OrderedFactor{3}
col_2Int64Count
col_3CategoricalValue{String,UInt32}OrderedFactor{2}
col_4CategoricalValue{String,UInt32}Multiclass{3}
col_5CategoricalValue{Float64,UInt32}Multiclass{2}

那么有没有省力的方法帮助我们修改科学类型? 可以用autotype来指定一些选项,如

  1. :few_to_finite 如果向量中数据很少,但有很多重复的,转为分类类型Finite

  2. :discrete_to_continuous 将离散的Count, Integer转为Continuous

  3. :string_to_multiclass 将String变量转为多分类变量 举几个例子 autotype(X, :few_to_finite)

Dict{Symbol,Type} with 5 entries:
  :col_5 => OrderedFactor
  :col_2 => OrderedFactor
  :col_3 => Multiclass
  :col_4 => Multiclass
  :col_1 => OrderedFactor

autotype(X, :discrete_to_continuous)

Dict{Symbol,Type} with 2 entries:
  :col_2 => Continuous
  :col_1 => Continuous

autotype(X, :string_to_multiclass)

Dict{Symbol,Type} with 2 entries:
  :col_3 => Multiclass
  :col_4 => Multiclass

如果要传入多个参数,把他们包装起来 autotype(X, (:string_to_multiclass, :few_to_finite))

Dict{Symbol,Type} with 5 entries:
  :col_5 => OrderedFactor
  :col_2 => OrderedFactor
  :col_3 => Multiclass
  :col_4 => Multiclass
  :col_1 => OrderedFactor

最后,只用把返回的字典带入coerce中就可以了

coerce(X, autotype(X, :string_to_multiclass)) |> schema

_.names_.types_.scitypes
col_1Int64Count
col_2Int64Count
col_3CategoricalArrays.CategoricalValue{String,UInt32}Multiclass{2}
col_4CategoricalArrays.CategoricalValue{String,UInt32}Multiclass{3}
col_5Float64Continuous

补充 对没有特征字段的数据,coerce直接在写类型参数就可以了: coerce([1,2,3], Continuous) # [1.0, 2.0, 3.0]

分类数据

CategoricalArray是为了完善科学类型中的Finite分类类型,专门设计的分类数据

OrderedFactor 有序的分类数据

  1. 转换

julia> x1 = coerce([1,2,3], OrderedFactor)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

  1. 构造

julia> x2 = categorical([1,2,3], ordered=true)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

  1. 查看分类顺序

julia> levels(x1)
3-element Array{Int64,1}:
 1
 2
 3

  1. 改变分类顺序

julia> levels!(x1, [3,2,1])
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> levels(x1)
3-element Array{Int64,1}:
 3
 2
 1

Multiclass 无序的分类数据

在搜索分类模型的时候,如果你细心点,你会发现一些不同

info("DecisionTreeClassifier").prediction_type == :probabilistic # true
info("SVMClassifier", pkg="ScikitLearn").prediction_type == :deterministic # true

其中 :probabilistic 指预测时返回的数据是每个分类的概率,如

import RDatasets
iris = RDatasets.dataset("datasets", "iris")
y, X = unpack(iris, ==(:Species), colname -> true)
@load DecisionTreeClassifier
tree_model = DecisionTreeClassifier()

tree = machine(tree_model, X, y)
train, test = partition(eachindex(y), 0.7, shuffle = true)
fit!(tree, rows=train)
yhat = predict(tree, rows=test)

 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
...

如何获取概率最大的分类呢?? 用mode函数

mode.(yhat)

tips 你如果想在预测的时候直接得到分类,就用predict_mode

setosa
versicolor
virginica
...

:deterministic 指预测时i返回的数据是单独的一个类别,如

@load SVMClassifier pkg=ScikitLearn
clf = fit!(machine(SVMClassifier(), X, y))
yhat = predict(clf, X)

 "setosa"
 "setosa"
 "setosa"
...

数据太多,就贴三个把:yum: 详细的分类数据文档可以看这里

已有位小伙伴已经翻译好了文档,大家可以看看 https://github.com/noob-data-analaysis/data-analysis/blob/master/%5B%E6%95%B0%E6%8D%AE%E5%8F%98%E6%8D%A2%5D%40AquaIndigo/%E6%95%B0%E6%8D%AE%E5%8F%98%E6%8D%A2.md

数据探索

img

总览 showtable

showtable(X) # 这个大家在jupyter notebook里试一下就好了,我这里不能导出markdown, 我让别人帮我试了一下也有问题,那就是作者的问题了

查看每列的科学类型 schema

schema(boston)

_.names_.types_.scitypes
CrimFloat64Continuous
ZnFloat64Continuous
IndusFloat64Continuous
ChasInt64Count
NOxFloat64Continuous
RmFloat64Continuous
AgeFloat64Continuous
DisFloat64Continuous
RadInt64Count
TaxInt64Count
PTRatioFloat64Continuous
BlackFloat64Continuous

注意

自定义查看内容 describe

需要注意的是,describe不能对命名元组起作用,需要DataFrame类型,这个函数是专门为DataFrame设计的

内置功能

describe(X, :nmissing) # 每一列有missing的数量
13×2 DataFrame

Rowvariablenmissing
SymbolNothing
1Crim
2Zn
3Indus
4Chas
5NOx
6Rm
7Age
8Dis
9Rad
describe(X, :min, :max, :mean, :std) # 每一列的最小值,最大值,平均值,标准差,他们会跳过missing

Rowvariableminmaxmeanstd
SymbolRealRealFloat64Float64
1Crim0.0063288.97623.613528.60155
2Zn0.0100.011.363623.3225
3Indus0.4627.7411.13686.86035
4Chas010.069170.253994
5NOx0.3850.8710.5546950.115878
6Rm3.5618.786.284630.702617
7Age2.9100.068.574928.1489
8Dis1.129612.12653.795042.10571
9Rad1249.549418.70726
10Tax187711408.237168.537
11PTRatio12.622.018.45552.16495
12Black0.32396.9356.67491.2949
13LStat1.7337.9712.65317.14106

自定义功能

desribe(X, :symbol => fn) # fn作用于整个列

desribe(X, :symbol => sum) 

Rowvariablesymbol
SymbolReal
1Crim1828.44
2Zn5750.0
3Indus5635.21
4Chas35
5NOx280.676
6Rm3180.02
7Age34698.9
8Dis1920.29
9Rad4832
10Tax206568
11PTRatio9338.5
12Black180477.0
13LStat6402.45

数据清洗

img

特征选择 FeatureSelector

文档 FeatureSelector(features=Symbol[])

注意 这个model用来选择DataFrameNamedTuple的特征字段

示例

model = FeatureSelector([:Crim]) # 选择Crim的特征字段
mach = fit!(machine(model, X))
MLJ.transform(mach, X) |> df -> first(df, 5) # 这里的transform会与DataFrame的transform冲突,要指定模块为MLJ

表格太难打了,我这里就给出5个数据好了

RowCrim
Float64
10.00632
20.02731
30.02729
40.03237
50.06905

清洗缺失值 FillImputer

文档

FillImputer(
   features        = [],
   continuous_fill = e -> skipmissing(e) |> median
   count_fill      = e -> skipmissing(e) |> (f -> round(eltype(f), median(f)))
   finite_fill     = e -> skipmissing(e) |> mode)

注意 FillImputer可以指定特征列来填充missing值,默认的填充函数以给出,也可以自己定义

  • continuous_fill: function to use on Continuous data, by default the median

  • count_fill: function to use on Count data, by default the rounded median

  • finite_fill: function to use on Multiclass and OrderedFactor data (including binary data), by default the mode

示例

df = coerce((x1 = 1:3, x2 = [missing, 1, 2]), :x2 => Continuous)
schema(df)

_.name_.types_.scitype
x1Int64Count
x2Union{Missing, Float64}Union{Missing, Continuous}
model = FillImputer(continuous_fill = e -> skipmissing(e) |> mean)
mach = fit!(machine(model, df))
w = MLJ.transform(mach, df)	
schema(w)

julia> w = MLJ.transform(mach, df)
(x1 = 1:3,
 x2 = [1.5, 1.0, 2.0],)

_.name_.types_.scitype
x1Int64Count
x2Union{Missing, Float64}Continuous

数据转换

img

数据标准化 Standardizer

文档 Standardizer(; features=Symbol[], ignore=false, ordered_factor=false, count=false)

newX = \frac{X' - mean(X)} {Std(X)}

注意 其中

  • X' 需要转换的数组

  • X 用来拟合的原数据

  • newX 转换X' 后的新数组

另外 Standardizer只对Continuous科学类型的数据有效,如果在数据集中有科学类型为OrderedFactorCountnums,可以在Standardizer中指定ordered_factor=truecount=true

示例

X = (ordinal1 = [1, 2, 3],
              ordinal2 = categorical([:x, :y, :x], ordered=true),
              ordinal3 = [10.0, 20.0, 30.0],
              ordinal4 = [-20.0, -30.0, -40.0],
              nominal = categorical(["Your father", "he", "is"]));

schema(X)

_.names_.types_.scitypes
ordinal1Int64Count
ordinal2CategoricalArrays.CategoricalValue{Symbol,UInt32}OrderedFactor{2}
ordinal3Float64Continuous
ordinal4Float64Continuous
nominalCategoricalArrays.CategoricalValue{String,UInt32}Multiclass{3}

尝试先不把ordinal1转换

model = Standardizer()
mach = fit!(machine(model, X))
transform(mach, X)

(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalArrays.CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalArrays.CategoricalValue{String,UInt32}["Your father", "he", "is"],)

下面我们将CountOrderedFactor转换 不过这里需要对ordered_factor=true另外说明 不管这个nums的内容是什么类型,Standardizer都能帮他转换。 不过在此之前先会把nums转化为数字数组

# 先将X的ordinal2提取出来
temp = X.ordered2
nums = coerce(temp, Count)
# 3-element Array{Int64,1}:
#  1
#  2
#  1

model = UnivariateStandardizer() # UnivariateStandardizer 和 Standardizer 类似, UnivariateStandardizer不能用在命名元组DataFrame上,另外UnivariateStandardizer没有参数,不会忽略Count类型
mach = fit!(machine(model, nums)
transform(mach, nums)```

```julia-repl
 -0.5773502691896256
  1.1547005383792517
 -0.5773502691896256

验证一下我们的想法

model = Standardizer(ordered_factor = true)
mach = fit!(machine(model, X))
transform(mach, X)

可以看到ordered2那里一毛一样

(ordinal1 = [-1.0, 0.0, 1.0],
 ordinal2 = [-0.5773502691896256, 1.1547005383792517, -0.5773502691896256],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalArrays.CategoricalValue{String,UInt32}["Your father", "he", "is"],)

数据归一化

文档里没有找到,可能要自定义模型了

数据离散化

A. 连续变量

本来连续变量的离散化分为等宽,等频,聚类等,但是在文档里只找到了等宽离散化的UnivariateDiretizer

文档

  UnivariateDiscretizer(n_classes=512)

  Returns an MLJModel for for discretizing any continuous vector v
  (scitype(v) <: AbstractVector{Continuous}), where n_classes describes
  the resolution of the discretization.

注意

等宽离散化,n_classes代表你想分多少个类 返回值为分类数组OrderedFactor

示例

这里我们对一个1 ~ 100的数组进行等宽离散化,我们把类别设置为10,转换一些随机数

data = coerce(1:100, Continuous)
t = UnivariateDiscretizer(n_classes = 10)
discretizer = fit!(machine(t, data))
v = rand(1:100, 10)
w = transform(discretizer, v)

随机数 v分类顺序
112
546
192
9210
435
536
879
233
394
9110

tipsconvert(Vector{Int}, w)获得分类数据的排序情况

B. 分类变量

  1. 有序变量 OrderedFactor 在文档里没有这个模型,不过作者告诉我可以用coerce强制转换科学类型 如果按原有的分类顺序来转换

    nums = categorical([:x, :y:, :z], ordered=true)
    levels(nums) # 1, 2, 3
    coerce(nums, Count) # 1,2,3
    coerce(nums, Continuous) # 1.0 2.0 3.0
    

    也可以改变分类顺序

    levels!(nums, [:z, :y, :z])
    coerce(nums, Count) # 3, 2, 1
    

  2. 无序变量 Multiclass 有两个模型可以做这个,OneHotEncoderContinuousEncoder

    OneHotEncoder(; features=Symbol[],
        ignore=false,
        ordered_factor=true,
    drop_last=false)
    

    ContinuousEncoder(one_hot_ordered_factors=false, drop_last=false)
    

    注意 两个模型作用一样,在转换的过程中保留Infinite数据,转换Multiclass数据,不过ContinuousEncoder会丢弃无关的数据,如Textual数据,OneHotEncoder会保留所有特征字段

    额,他们怎么转换我说不清,看代码吧

    OneHotEncoder:

    data = (col = ["a", "b", "c"],)
    nums = coerce(data, :col => Multiclass{3})
    model = OneHotEncoder()
    mach = fit!(machine(model, nums))
    transform(mach, nums)
    

    (col__a = [1.0, 0.0, 0.0],
    col__b = [0.0, 1.0, 0.0],
    col__c = [0.0, 0.0, 1.0],)
    

    ContinuousEncoder:

    data = (col = ["a", "b", "c"],
    vals = [1, 2, 3])
    schema(data)
    

    _.names_.types_.scitypes
    colStringTextual
    valsInt64Count
    model = ContinuousEncoder()
    mach  = fit!(machine(model, data))
    transform(mach, data)
    

    (vals = [1.0, 2.0, 3.0],)
    

详细文档在这里