4-Druid数据摄入-1 - CSDN博客

博客

学院

下载

GitChat

论坛

问答

商城

VIP

活动

招聘

ITeye

码云

CSTO



 写博客

 发Chat

登录注册

4-Druid数据摄入-1

翻译

2017年11月19日 12:06:50

标签：

Druid

一、数据格式

【1】Data Formats http://druid.io/docs/0.10.1/ingestion/data-formats.html (1)摄入规范化数据：JSON、CSV、TSV (2)自定义格式 Regex parser or the JavaScript parsers 来解析数据 (3）其他格式 http://druid.io/docs/0.10.1/development/extensions.html 【2】Configuration 对数据格式进行配置dataSchema中的parseSpec字段具体见： http://druid.io/docs/0.10.1/ingestion/data-formats.html

二、数据schema

主要是摄入的规则ingestion Spec 摄入规则主要包含3个部分

{

“dataSchema”

{ … },

“ioConfig”

{ … },

“tuningConfig”

{ … }

}

Field

Type

Description

Required

dataSchema

JSON Object

标识摄入数据的schema，不同specs可共享

yes

ioConfig

JSON Object

标识data从哪来，到哪去。根据不同的ingestion method不同

yes

tuningConfig

JSON Object

标识如何调优不同的ingestion parameters。根据不同的ingestion method不同

DataSchema

“dataSchema”

{

“dataSource”

“wikipedia” ,

“parser”

{

“type”

“string” ,

“parseSpec”

{

“format”

“json” ,

“timestampSpec”

{

“column”

“timestamp” ,

“format”

“auto”

“dimensionsSpec”

{

“dimensions” :

[

“page” ,

“language” ,

“user” ,

“unpatrolled” ,

“newPage” ,

“robot” ,

“anonymous” ,

“namespace” ,

“continent” ,

“country” ,

“region” ,

“city” ,

{

“type” :

“long” ,

“name” :

“countryNum”

{

“type” :

“float” ,

“name” :

“userLatitude”

{

“type” :

“float” ,

“name” :

“userLongitude”

}

“dimensionExclusions”

[],

“spatialDimensions”

[]

}

“metricsSpec”

[{

“type”

“count” ,

“name”

“count”

{

“type”

“doubleSum” ,

“name”

“added” ,

“fieldName”

“added”

{

“type”

“doubleSum” ,

“name”

“deleted” ,

“fieldName”

“deleted”

{

“type”

“doubleSum” ,

“name”

“delta” ,

“fieldName”

“delta”

}],

“granularitySpec”

{

“segmentGranularity”

“DAY” ,

“queryGranularity”

“NONE” ,

“intervals”

[

“2013-08-31/2013-09-01”

]

}

Field

Type

Description

Required

dataSource

String

要摄入的datasource名称，Datasources可看做为表

yes

parser

JSON Object

ingested data如何解析

yes

metricsSpec

JSON Object array

aggregators 器列表

yes

granularitySpec

JSON Object

如何建立.segments，如何上卷数据

yes

Parser

“parser”

{

“type”

“string” ,

“parseSpec”

{

“format”

“json” ,

“timestampSpec”

{

“column”

“timestamp” ,

“format”

“auto”

“dimensionsSpec”

{

“dimensions” :

[

“page” ,

“language” ,

“user” ,

“unpatrolled” ,

“newPage” ,

“robot” ,

“anonymous” ,

“namespace” ,

“continent” ,

“country” ,

“region” ,

“city” ,

{

“type” :

“long” ,

“name” :

“countryNum”

{

“type” :

“float” ,

“name” :

“userLatitude”

{

“type” :

“float” ,

“name” :

“userLongitude”

}

“dimensionExclusions”

[],

“spatialDimensions”

[]

}

type 默认为string，其他数据格式见： extensions list .

String Parser

Field

Type

Description

Required

type

String

一般为string，或在Hadoop indexing job中使用hadoopyString

parseSpec

JSON Object

标识格式format和、imestamp、dimensions

yes

ParseSpec

两个功能:

String Parser用parseSpec判定将要处理rows的数据格式（ JSON, CSV, TSV）所有的Parsers 用parseSpec判定将要处理rows的 timestamp 和 dimensionsAll format字段默认为tsv格式

JSON ParseSpec

Field

Type

Description

Required

format

String

json .

timestampSpec

JSON Object

timestamp的列和format

yes

dimensionsSpec

JSON Object

数据的dimensions

yes

flattenSpec

JSON Object

标识嵌套JSON如何打平的配置，详见 Flattening JSON

JSON Lowercase ParseSpec

将输入的JSON数据小写处理

Field

Type

Description

Required

format

String

T jsonLowercase .

yes

timestampSpec

JSON Object

timestamp的列和format

yes

dimensionsSpec

JSON Object

数据的dimensions

yes

CSV ParseSpec

使用String Parser 加载CSV，Strings用net.sf.opencsv library. parsed

Field

Type

Description

Required

format

String

csv .

yes

timestampSpec

JSON Object

timestamp的列和format

yes

dimensionsSpec

JSON Object

数据的dimensions

yes

listDelimiter

String

多值dimensions的分割符

no (default == ctrl+A)

columns

JSON array

数据列

yes

TimestampSpec

Field

Type

Description

Required

column

String

timestamp的列

yes

format

String

iso, millis, posix, auto or any Joda time format.

no (default == ‘auto’

DimensionsSpec

Field

Type

Description

Required

dimensions

JSON array

dimension schema 对象或dimension names，标识维度列，否则将timestamp列外的所以string列作为维度列

yes

dimensionExclusions

JSON String array

ingestion之外的dimensions

no (default == []

spatialDimensions

JSON Object array

spatial dimensions

no (default == []

Dimension Schema

dimension schema标识要摄入dimension的type和name，不特殊标识type时为string

“dimensionsSpec”

{

“dimensions” :

[

“page” ,

“language” ,

“user” ,

“unpatrolled” ,

“newPage” ,

“robot” ,

“anonymous” ,

“namespace” ,

“continent” ,

“country” ,

“region” ,

“city” ,

{

“type” :

“long” ,

“name” :

“countryNum”

{

“type” :

“float” ,

“name” :

“userLatitude”

{

“type” :

“float” ,

“name” :

“userLongitude”

}

“dimensionExclusions”

[],

“spatialDimensions”

[]

}

GranularitySpec

“granularitySpec”

{

“segmentGranularity”

“DAY” ,

“queryGranularity”

“NONE” ,

“intervals”

[

“2013-08-31/2013-09-01”

]

}

granularity spec 默认是uniform，可以通过type字段配置，目前支持uniform和 arbitrary types

Uniform Granularity Spec

标识uniform intervals.

Field

Type

Description

Required

segmentGranularity

string

建立segments的周期

no (default == ‘DAY’)

queryGranularity

string

可query结果的最小granularity，数据已这个granularity在segment中granularity

例如： “minute” 说明 data已分钟级别的granularity聚合，也就是当 (minute(timestamp), dimensions）

tuple中有collisions时，将用aggregators聚合值，而不是对各个rows排序

no (default == ‘NONE’)

rollup

boolean

rollup or not

no (default == true)

intervals

string

raw data摄入的intervals列表，对于real-time摄取忽略

yes for batch, no for real-time

Arbitrary Granularity Spec

按照segments的大小决定intervals，不支持real-time

Field

Type

Description

Required

queryGranularity

string

同上

no (default == ‘NONE’)

rollup

boolean

rollup or not

no (default == true)

intervals

string

同上

yes for batch, no for real-time

三、Schema Design

Druid将规范化后的数据分为3类：a timestamp, a dimension, or a measure (or a metric/aggregator as they are known in Druid).

更多信息:

Timestamp每行必须，数据以时间分区，每个query有一个时间filter ，Query results 可以用时间分桶（ minutes, hours, days, and so on） Dimensions可以filtered或者grouped by，一般是单Strings，Strings数组，单Longs，单Floats Metrics可以aggregated，可排序一般生产tables（datasources）少于100个维度列，100个metrics

Numeric dimensions

数据类型的维度 (Long or Float) 必须在dimensionsSpec中标识，否则默认是字符串，数值型列在group时快，但由于没有索引在过滤时慢， Dimension Schema .

High cardinality dimensions (e.g. unique IDs)

实际中count-distinct不需要，对IDs列排序将杀掉 roll-up ,影响压缩，再aggregations带着排序的IDS，增加性能减少存储，Druid’s hyperUnique aggregator 基于Hyperloglog， here .

Nested dimensions

不支持嵌套维度，下面

{“foo”:{“bar”: 3}}

在索引前转化为:

{“foo_bar”: 3}

Counting the number of ingested events

count aggregator 在数据摄入阶段计算摄入的数据量，在查询时用 longSum aggregator.，根据这个计算结果决定roll-up 的速率

ingestion spec:

… “metricsSpec” : [ { “type” : “count”, “name” : “count” }, … :

按照如下查询摄入的量

… “aggregations”: [ { “type”: “longSum”, “name”: “numIngestedEvents”, “fieldName”: “count” }, …

Schema-less dimensions

dimensions在spec缺失时，所有非timestamp 的列作为string型作为维度

Including the same column as a dimension and a metric

一个列作为维度，同时由于去重计算需要，也作为hyperUnique，作为metric，这需要在ETL组织时就增加出来，

ETL中复制一列=:

{“device_id_dim”:123, “device_id_met”:123}

在metricsSpec :

{ “type” : “hyperUnique”, “name” : “devices”, “fieldName” : “device_id_met” }

device_id_dim 自动作为维度

四、Schema Changes

datasources可以在任何时间改变，支持segments中存在不同的schemas

Replacing Segments

segments标识：datasource, interval, version, and partition number.partition number只在同一个granularity产生多个segments时可见，如hourly segments，在一个小时中的数据量超出一个segment存储范围，同一小时产生多个segments，以partition number区分

foo_2015-01-01/2015-01-02_v1_0 foo_2015-01-01/2015-01-02_v1_1 foo_2015-01-01/2015-01-02_v1_2

dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum = 0. 如果此时用新的schema索引数据,新产生的segment有更高的version id。

foo_2015-01-01/2015-01-02_v2_0 foo_2015-01-01/2015-01-02_v2_1 foo_2015-01-01/2015-01-02_v2_2

Druid是批量构建索引的（either Hadoop-based or IndexTask-based），保证interval-by-interval间的原子性更新，例如直到 2015-01-01/2015-01-02 间隔内的 v2 segments加载到集群中后吗，queries才不再使用 v1 segments，此时v1从集群中卸载。

updates是夸过个segment的，指示在每个interval内是原子性的，不是整个更新的如下：

foo_2015-01-01/2015-01-02_v1_0 foo_2015-01-02/2015-01-03_v1_1 foo_2015-01-03/2015-01-04_v1_2

v2 segments 完全更新前，混存：

foo_2015-01-01/2015-01-02_v1_0 foo_2015-01-02/2015-01-03_v2_1 foo_2015-01-03/2015-01-04_v1_2

此时的查询可以命中V1和V2的混合

In this case, queries may hit a mixture of v1 and v2 segments.

Different Schemas Among Segments

datasource的segments可以有不同的schemas，如果一个stringcolumn (dimension) 在一个segment A中存在，另一个B不存在，认为B中该维度为null。对于numeric column，Aggregations跳过这条

 目前您尚未登录，请登录或注册后进行评论

hjw199089

原创

115

粉丝

喜欢

等级：

访问量：

6万+

积分：

2143

排名：

2万+

博主最新文章