上篇博客中,说了一下转化、分组、聚合,此博客接着连接。连接分为下面:
- union : 将数据类型相同的流合并成一个流。
- connect: 将数据类型不同的流合并一个流
- cogroup: 将数据类型不同的流合并成一个流并写到缓存到窗口中。
- join: 将数据类型不同的流合并成一个流并写到缓存到窗口中,当窗口被触发之后,两边的数据进行笛卡尔积式的计算。
- interval join : 处理数据的逻辑基本和 join 差不多,多了一点式可以扩大两个流之间的匹配范围,比如,A 是 stream1 的数据,B 是 stream2 的数据,A.timestamp - interval time <= B.timestamp <= A.timestamp + interval time 的数据。
- broadcast , 广播流,它会将广播流中的所有数据发送到另外一个流中的所有分区中,然后实现计算逻辑,这一特性可以让我们实现关联维表的功能。处理维表关联的其他方案还有异步I/O。这个会单独写一篇博客来讲解。
下面来展示一下所有 join 类型算子的功能。
union 的用法:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);DataStream<String> union = src1.union(src2);union.print("------");env.execute("test-union");
两个 source 从 socket 中读书数据,数据类型是 String 类型的,然后将两个流 union 起来,连接起来的数据都是一样的。
connect 的用法。当遇到得到两个 topic 中的数据之后,才能计算的情况下,需要使用 connect 将两个 topic 中的数据取出。下面的例子中,模拟了 inner join on 的效果,也就是取交集的效果,使用了 map state 来存储已经到来的数据,当另外一个流中的相关数据到来时,往下发送。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);DataStreamSource<String> src2 = env.socketTextStream("127.0.0.1", 8888);KeyedStream<Integer, String> intSrc = src1.map(new RichMapFunction<String, Integer>() {@Overridepublic Integer map(String record) throws Exception {return Integer.parseInt(record);}}).keyBy(new KeySelector<Integer, String>() {@Overridepublic String getKey(Integer integer) throws Exception {return integer.toString();}});KeyedStream<String, String> keyedSrc2 = src2.keyBy(x -> x);/*** 模拟 inner join 的逻辑,取交集* */intSrc.connect(keyedSrc2).process(new CoProcessFunction<Integer, String, Tuple2<String,Integer>>() {private ValueState<List<String>> stream1Buffer = null ;private ValueState<List<String>> stream2Buffer = null ;@Overridepublic void open(Configuration parameters) throws Exception {super.open(parameters);ValueStateDescriptor<List<String>> stream1BufferDesc = new ValueStateDescriptor<List<String>>("Stream1Buffer", TypeInformation.of(new TypeHint<List<String>>() {}));ValueStateDescriptor<List<String>> stream2BufferDesc = new ValueStateDescriptor<List<String>>("Stream2Buffer", TypeInformation.of(new TypeHint<List<String>>() {}));stream1Buffer = getRuntimeContext().getState(stream1BufferDesc);stream2Buffer = getRuntimeContext().getState(stream2BufferDesc);}@Overridepublic void processElement1(Integer record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {join(record.toString() , collector , stream2Buffer , stream1Buffer);}@Overridepublic void processElement2(String record, CoProcessFunction<Integer, String, Tuple2<String, Integer>>.Context context, Collector<Tuple2<String, Integer>> collector) throws Exception {join(record , collector , stream1Buffer, stream2Buffer);}private void join(String record , Collector<Tuple2<String, Integer>> collector , ValueState<List<String>> streamBuffered , ValueState<List<String>> streamOwer) throws IOException {List<String> buffered = streamBuffered.value();if(Objects.isNull(buffered)){buffered = new CopyOnWriteArrayList<>();}int idx = Collections.<String>binarySearch(buffered, record);if(idx>=0){String s = buffered.get(idx);buffered.remove(idx);collector.collect(new Tuple2<String,Integer>(record+" join " + s , 1));}else{buffered.add(record);streamOwer.update(buffered);}}}).print("------");env.execute();
join 的用法,JoinFunction 接口中,一次处理两个流中个一条数据,而且是笛卡尔积的方式发送给此接口计算。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{return Long.parseLong(data.split(",")[2]);});KeySelector<String, String> keySelector = new KeySelector<String, String>() {@Overridepublic String getKey(String s) throws Exception {return s.split(",")[0];}};SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);src1.join(src2).where(keySelector).equalTo(keySelector).window(TumblingEventTimeWindows.of(Time.seconds(2))).apply(new JoinFunction<String, String, String>() {@Overridepublic String join(String s, String s2) throws Exception {return s.concat(":").concat(s2);}}).print("----");env.execute();
interval join 的用法,当发送测试数据的时候,会比上面的 join 早触发 1 秒,因我我设置了 interval 是 1 秒
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data ,long ts )->{return Long.parseLong(data.split(",")[2]);});KeyedStream<String, String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws).keyBy(new KeySelector<String, String>() {@Overridepublic String getKey(String value) throws Exception {return value.split(",")[0];}});KeyedStream<String, String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws).keyBy(new KeySelector<String, String>() {@Overridepublic String getKey(String value) throws Exception {return value.split(",")[0];}});src1.intervalJoin(src2).between(Time.seconds(-1) , Time.seconds(1)).upperBoundExclusive().lowerBoundExclusive().process(new ProcessJoinFunction<String, String, String>() {@Overridepublic void processElement(String left, String right, Context ctx, Collector<String> out) throws Exception {out.collect(left + "-->" + right);}}).print();env.execute("test-interval-join");
coGroup 算子的用法,我使用的 tumbling time window ,时间使用的 eventtime ,当数据的最大时间戳到达了窗口的最大时间,则窗口被触发,执行 RichCoGroupFunction 接口中的计算。这里我使用了 forBoundedOutOfOrderness 的 watermark ,它里面的参数是 1 ,所有会比正常的窗口晚触发 1 秒。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);WatermarkStrategy<String> ws = WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(1L)).withTimestampAssigner((String data , long ts )->{return Long.parseLong(data.split(",")[1]);});SingleOutputStreamOperator<String> src1 = env.socketTextStream("127.0.0.1", 6666).assignTimestampsAndWatermarks(ws);SingleOutputStreamOperator<String> src2 = env.socketTextStream("127.0.0.1", 8888).assignTimestampsAndWatermarks(ws);src1.keyBy(new KeySelector<String, String>() {@Overridepublic String getKey(String s) throws Exception {return s.split(",")[0] ;}}).coGroup(src2.keyBy(new KeySelector<String, String>() {@Overridepublic String getKey(String s) throws Exception {return s.split(",")[0] ;}})).where(new KeySelector<String, String>() {@Overridepublic String getKey(String src1Data) throws Exception {return src1Data.split(",")[0] ;}}).equalTo(new KeySelector<String, String>() {@Overridepublic String getKey(String src2Data) throws Exception {return src2Data.split(",")[0] ;}}).window(TumblingEventTimeWindows.of(Time.seconds(2))).apply(new RichCoGroupFunction<String, String, String>() {@Overridepublic void coGroup(Iterable<String> first, Iterable<String> second, Collector<String> collector) throws Exception {String a = "" ;String b = "" ;for(String e : first){a+=e;}for(String e : second){b+=e;}collector.collect(a + ":" + b);}}).print("-----");env.execute();
broadcast 广播流的功能演示,下面的例子是官方文档中的例子,很简单的例子,维表关联有一个数据预加载的问题,可以将维表中的数据加载到类的本地变量中,也可以在广播流中给那些没有关联到维表的数据打标记,然后在后面的算子中将打过标记的数据发送到测流中,进行处理。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();DataStreamSource<String> src1 = env.socketTextStream("127.0.0.1", 6666);DataStreamSource<String> stringDataStreamSource = env.fromElements("green,good", "blue,excellant", "purple,2", "red,4");MapStateDescriptor<String,String> mapDesc = new MapStateDescriptor<String,String>("rule" ,String.class,String.class);BroadcastStream<String> broadcast = stringDataStreamSource.broadcast(mapDesc);src1.connect(broadcast).process(new BroadcastProcessFunction<String,String,String>(){private final MapStateDescriptor<String,String> mapRule = new MapStateDescriptor<String, String>("rule",String.class , String.class);@Overridepublic void processElement(String value, ReadOnlyContext ctx, Collector<String> out) throws Exception {String s = ctx.getBroadcastState(mapRule).get(value);out.collect("out:"+s);}@Overridepublic void processBroadcastElement(String value, Context ctx, Collector<String> out) throws Exception {ctx.getBroadcastState(mapRule).put(value.split(",")[0],value.split(",")[1]);}}).print("------");env.execute();
打完收工。