美图欣赏:
一.数据样式:
二.要求
求出访问量最高的两个网页
要求显示:网页名称、访问量
三.代码实现:
import org.apache.spark.{SparkConf, SparkContext}object TomcatLogCount extends App {//获取spark的scvar conf = new SparkConf().setAppName("count").setMaster("local")var sc = new SparkContext(conf)//1.读取文件var linerdd = sc.textFile("D:\\testdata\\streaming\\localhost_access_log.txt")//2.解析日志:网页名称/*** 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713* 网页名称:MyDemoWeb/head.jsp* */var rdd1 = linerdd.map(line =>{//1.或两个引号之间的数据var index1 = line.indexOf("\"")var index2 = line.lastIndexOf("\"")
//substring 前闭后开区间var line1 = line.substring(index1+1,index2)// GET /MyDemoWeb/head.jsp HTTP/1.1//2.获取两个空格之间的数据var index3 = line1.indexOf(" ")var index4 = line1.lastIndexOf(" ")var line2 = line1.substring(index3+1,index4)// /MyDemoWeb/head.jsp//3.获取jsp的名字var name = line2.substring(line2.indexOf("/")+1)(name,1)})//3.聚合var rdd2 = rdd1.reduceByKey(_+_)//4.排序,访问量降序var result = rdd2.sortBy(_._2,false)//5.打印result.foreach(println)sc.stop()
}
四.打印结果:
————保持饥饿,保持学习
Jackson_MVP