标签:hdfseventsink 目录实现分析 flume源码
这里以按自定义头部的配置为例(根据某些业务不同写入不同的主目录)
配置:
source:
interceptors = i1 interceptors.i1.type = regex_extractor interceptors.i1.regex = /apps/logs/(.*?)/ interceptors.i1.serializers = s1 interceptors.i1.serializers.s1.name = logtypename
sink:
hdfs.path = hdfs://xxxxxx/%{logtypename}/%Y%m%d/%H hdfs.round = true hdfs.roundValue = 30 hdfs.roundUnit = minute hdfs.filePrefix = xxxxx1-
在source中定义了regex_extractor 类型的interceptor,使用org.apache.flume.interceptor.RegexExtractorInterceptor类构建interceptor对象,这个interceptor可以根据一个正则表达式提取字符串,并使用serializers把字符串作为header的值,这header可以在sink中获取对应的值做进一步的操作.
比如写hdfs的sink HDFSEventSink的process方法中
// reconstruct the path name by substituting place holders String realPath = BucketPath.escapeString(filePath, event.getHeaders(), timeZone, needRounding, roundUnit , roundValue , useLocalTime ); String realName = BucketPath.escapeString(fileName, event.getHeaders(), timeZone, needRounding, roundUnit , roundValue , useLocalTime );
几个参数项:
useLocalTime 是hdfs.useLocalTimeStamp的设置,默认是false
filePath为hdfs.path的设置,不能为空
fileName为hdfs.filePrefix的设置,默认为FlumeData
rounding(取近似值)的设置相关:
needRounding = context.getBoolean( "hdfs.round", false ); //hdfs.round的设置,默认为false if(needRounding) { String unit = context.getString( "hdfs.roundUnit", "second" ); //hdfs.roundUnit,默认为second if (unit.equalsIgnoreCase( "hour")) { this.roundUnit = Calendar.HOUR_OF_DAY; } else if (unit.equalsIgnoreCase("minute" )) { this.roundUnit = Calendar.MINUTE; } else if (unit.equalsIgnoreCase("second" )){ this.roundUnit = Calendar.SECOND; } else { LOG.warn("Rounding unit is not valid, please set one of" + "minute, hour, or second. Rounding will be disabled" ); needRounding = false ; } this.roundValue = context.getInteger("hdfs.roundValue" , 1); //hdfs.roundValue值的设置,默认为1 if(roundUnit == Calendar. SECOND || roundUnit == Calendar.MINUTE){ //下面个为检测roundValue的值是否设置合理 Preconditions.checkArgument(roundValue > 0 && roundValue <= 60, "Round value" + "must be > 0 and <= 60"); } else if (roundUnit == Calendar.HOUR_OF_DAY){ Preconditions.checkArgument(roundValue > 0 && roundValue <= 24, "Round value" + "must be > 0 and <= 24"); } }
hdfs的具体路径主要由org.apache.flume.formatter.output.BucketPath类的escapeString方法实现
BucketPath类方法分析:
1.escapeString用于替换%{yyy}的设置和%x的设置,需要设置为%x或者%{yyy}的形式,yyy可以是单词字符,和.或者-其调用replaceShorthand
final public static String TAG_REGEX = "\\%(\\w|\\%)|\\%\\{([\\w\\.-]+)\\}"; //正则表达式 final public static Pattern tagPattern = Pattern.compile(TAG_REGEX); .... public static String escapeString(String in, Map<String, String> headers, TimeZone timeZone, boolean needRounding, int unit, int roundDown, boolean useLocalTimeStamp) { long ts = clock.currentTimeMillis(); //获取当前的时间戳 Matcher matcher = tagPattern.matcher(in); //对输入的字符串进行matcher操作,返回Matcher对象,比如这里in可以是hdfs.path的设置 StringBuffer sb = new StringBuffer(); while (matcher.find()) { //用于查看字符串中是否有子字符串可以匹配正则表达式,有的话进入循环中 String replacement = ""; if (matcher.group(2) != null) { //匹配%{...}的设置 replacement = headers.get(matcher.group(2)); //获取对应的header值 if (replacement == null) { replacement = ""; } } else { //匹配%x的设置 Preconditions.checkState(matcher.group(1) != null && matcher.group(1).length() == 1, "Expected to match single character tag in string " + in); char c = matcher.group(1).charAt(0); replacement = replaceShorthand(c, headers, timeZone, needRounding, unit, roundDown, useLocalTimeStamp, ts); //对字符调用replaceShorthand方法 } replacement = replacement.replaceAll("\\\\", "\\\\\\\\"); replacement = replacement.replaceAll("\\$", "\\\\\\$"); matcher.appendReplacement(sb, replacement); } matcher.appendTail(sb); return sb.toString(); //返回字符串 }
2.replaceShorthand方法用于根据timestamp header的值和round的设置以及路径设置返回对应的日期字符串(比如%Y生成yyyyy(20150310)形式的日期),会调用roundDown方法
protected static String replaceShorthand(char c, Map<String, String> headers, TimeZone timeZone, boolean needRounding, int unit, int roundDown, boolean useLocalTimestamp, long ts) { String timestampHeader = null; try { if(!useLocalTimestamp) { //hdfs.useLocalTimeStamp设置为false时(默认) timestampHeader = headers.get("timestamp"); //获取timestamp的值 Preconditions.checkNotNull(timestampHeader, "Expected timestamp in " + "the Flume event headers, but it was null"); //检测timestamp header的值是否为空 ts = Long.valueOf(timestampHeader); } else { timestampHeader = String.valueOf(ts); } } ... if(needRounding){ //如果hdfs.round设置为true(默认为false) ts = roundDown(roundDown, unit, ts); //调用roundDown向下取整,生成新的ts } // It‘s a date String formatString = ""; switch (c) { //对字符串进行匹配,生成日期格式,比如%Y%m%d 最后生成的日期格式为yyyyMMdd case ‘%‘: return "%"; case ‘a‘: formatString = "EEE"; break; ..... case ‘z‘: formatString = "ZZZ"; break; default: // LOG.warn("Unrecognized escape in event format string: %" + c); return ""; } SimpleDateFormat format = new SimpleDateFormat(formatString); //根据格式生成SimpleDateFormat对象 if (timeZone != null) { format.setTimeZone(timeZone); } Date date = new Date(ts); //由ts生成Date对象 return format.format(date); //根据Date对象生成时间字符串 }
3.roundDown用于向下取整
private static long roundDown(int roundDown, int unit, long ts){ long timestamp = ts; if(roundDown <= 0){ roundDown = 1; } switch (unit) { case Calendar. SECOND: timestamp = TimestampRoundDownUtil. roundDownTimeStampSeconds( ts, roundDown); //如果hdfs.roundUnit是second调用TimestampRoundDownUtil.roundDownTimeStampSeconds方法 break; .... default: timestamp = ts; break; } return timestamp; }
本文出自 “菜光光的博客” 博客,请务必保留此出处http://caiguangguang.blog.51cto.com/1652935/1619539
标签:hdfseventsink 目录实现分析 flume源码
原文地址:http://caiguangguang.blog.51cto.com/1652935/1619539