Python工业项目实战 04：数仓维度层DWS层构建

2023-10-26

知识点01：课程回顾

ODS层与DWD层的功能与区别是什么？
- ODS：原始数据层
  - 存储格式：AVRO
  - 数据内容：基本与原始数据是一致的
- DWD：明细数据层
  - 存储格式：Orc
  - 数据内容：基于与ODS层是一致的
ODS层的需求是什么？
- 自动化建库建表
- 建表
```
create table one_make_ods.表名
tableproperties(schema文件)
```
  - 表名
  - 表的注释
  - 表对应的HDFS地址
  - Schema文件的地址
DWD层的需求是什么？
- 自动化建库建表
- 建表
```
create table one_make_dwd.表名(
	字段信息
)
location
```
  - 表名
  - 表的注释
  - 表对应的HDFS地址
  - 字段信息
怎么获取表的Schema信息？
- 表的注释、Schema信息
- ODS：Oracle中获取表的注释、Schema文件从Sqoop生成的
- DWD：Oracle中获取表的信息
  - TableMeta：表名，表的注释，列的信息：List
  - ColumnMeta：列名、列的注释、列的类型、长度、精度
如何使用Python构建Oracle和Hive的连接？
- Oracle：cx_Oracle
  - conn（host,port,user,passwd,sid）
- Hive/SparkSQL：pyHive
  - SparkSQL用法
  - 编程方式：python文件 | jar包
    - 流程
      - step1：SparkSession
      - step2：读取数据源
      - step3：处理
        
        注册视图
        
        spark.sql(“”)
    - spark-submit
    - 优点：灵活
    - 场景：DSL
  - 提交SQL：ThriftServer
    - 场景：SQL，调度开发
    - 流程
      - JDBC | PyHive | Beeline：代码中开发
      - spark-sql -f xxxx.sql：SQL文件的运行
如果实现SQL语句的执行？
- step1：先构建服务端的远程连接
  - 服务端地址：主机名 + 端口
  - 用户名和密码
- step2：基于这个连接构建一个游标
- step3：通过游标来执行SQL语句：execute(String：SQL)
- step4：释放资源
集中问题
- 连接构建不上
  - 映射不对：spark.bigdata.cn：Can not Connect[46.xxx.xxxx.xx，10001]
    - 修改host文件
    - 修改config.txt：node1
  - 服务端问题：spark.bigdata.cn：Can not Connect[192.168.88.100，10001]
    - Spark的TriftServer没有启动
      - 启动完成先用dg或者beeline先测试一下
    - ThriftServer：依赖于MetaStore + YARN
      - 检查YARN：本质就是一个Spark程序：实时程序，不停止的
  - Oracle：cx_Oracle + 本地客户端：D:\instantclient_12_2
- 安装
  - Python版本：Python 3.7.4
  - 命令sasl

知识点02：课程目标

回顾维度建模
- 建模流程：划分主题域和主题
- 维度设计：构建维度矩阵
- 维度模型：雪花模型、星型模式
项目中的建模流程和维度设计
- 划分了哪些主题域，每个主题域有哪些主题？
- 每个主题基于哪些维度进行分析？
维度层构建
- 时间维度
- 地区维度
- 油站维度
- 服务网点维度
- 组织机构维度
- 仓库维度
- 物流维度

知识点03：维度建模回顾：建模流程

目标：掌握维度建模的建模流程
实施
- step1-需求调研：业务调研和数据调研
  - 了解整个业务实现的过程
  - 收集所有数据使用人员对于数据的需求
  - 整理所有数据来源
- step2-划分主题域：面向业务将业务划分主题域及主题
  - 用户域、店铺域
  - 商品域、交易域、
  - 客服域、信用风控域、采购分销域
- step3-构建维度总线矩阵：明确每个业务主题对应的维度关系
- step4-明确指标统计：明确所有原生指标与衍生指标
  - 原生指标：基于某一业务事件行为下的度量，是业务定义中不可再拆分的指标，如支付总金额
  - 衍生指标：基于原子指标添加了维度：近7天的支付总金额等
- step5-定义事实与维度规范
  - 命名规范、类型规范、设计规范等
- step6-代码开发
  - 实现具体的代码开发
  - 只要知道指标的计算方式，基于维度分组计算指标
小结
- 掌握维度建模的建模流程

知识点04：维度建模回顾：维度设计

目标：掌握维度建模中维度的设计
实施
- 功能：基于组合维度来更加细化我们的指标，来更加精确的发现问题
  - 指标如果不基于组合维度进行分析得到，指标的结果是没有意义的
- 实现：开发中维度就是指标聚合时的分组字段
- 特点
  - 数据量小
  - 很少发生变化
- 采集方式：全量
- 常见维度
  - 时间维度：年、季度、月、周、天、小时
  - 地区维度：国家、省份、城市
  - 平台维度：网站、APP、小程序、H5
  - 操作系统维度：Windows、Mac OS、Android、Linux、IOS
  - ……
小结
- 掌握维度建模中维度的设计

知识点05：维度建模回顾：维度模型

目标：掌握维度设计的常用模型
路径
- step1：雪花模型
- step2：星型模型
- step3：星座模型
实施
- 雪花模型
  - 设计：部分维度通过其他维度间接关联事实表
  - 优点：避免数据的冗余
  - 缺点：关联层次比较多，数据大的情况下，底层层层Join，查询数据性能降低

在这里插入图片描述

星型模型
- 设计：所有维度表直接关联事实表
- 优点：每次查询时候，直接获取对应的数据结果，不用关联其他的维度子表，可以提高性能
- 缺点：数据冗余度相比雪花模型较高

在这里插入图片描述

星座模型
- 星座模型：基于星型模型的演变，多个事实共同使用一个维度表

在这里插入图片描述

小结
- 掌握维度设计的常用模型

知识点06：一站制造业务主题划分

目标：掌握一站制造的主题域及主题的划分
实施
- 来源
  - 主题域划分：业务或者部门划分
    - 业务：客户域、广告域、运营域……
    - 部门：运维域、财务域、销售域……
  - 数据需求来划分主题
    - 运营域：访问分析报表、转化分析报表、用户属性分析报表、订单分析报表
- 服务域
  - 安装主题：安装方式、支付费用、安装类型
  - 工单主题：派工方式、工单总数、派工类型、完工总数、
  - 维修主题：支付费用、零部件费用、故障类型
  - 派单主题：派单数、派单平均值、派单响应时间
  - 费用主题：差旅费、安装费、报销人员统计
  - 回访主题：回访人员数、回访工单状态
  - 油站主题：油站总数量、油站新增数量
- 客户域
  - 客户主题：安装数量、维修数量、巡检数量、回访数量
- 仓储域
  - 保内良品核销主题：核销数量、配件金额
  - 保内不良品核销主题：核销配件数、核销配件金额
  - 送修主题：送修申请、送修物料数量、送修类型
  - 调拨主题：调拨状态、调拨数量、调拨设备类型
  - 消耗品核销：核销总数、核销设备类型
- 服务商域
  - 工单主题：派工方式、工单总数、工单类型、客户类型
  - 服务商油站主题：油站数量、油站新增数量
- 运营域
  - 运营主题：服务人员工时、维修站分析、平均工单、网点分布
- 市场域
  - 市场主题：工单统计、完工明细、订单统计
小结
- 掌握一站制造的主题域及主题的划分

知识点07：一站制造业务维度设计

目标：掌握一站制造业务维度设计
实施
- 日期时间维度
  - 年维度、季度维度、月维度、周维度、日维度
  - 日环比、周环比、月环比、日同比、周同比、月同比
  - 环比：同一个周期内的比较
  - 同比：上个个周期的比较
- 行政地区维度
  - 地区级别：国家维度、省份维度、城市维度、县区维度、乡镇维度
- 服务网点维度
  - 网点名称、网点编号、省份、城市、县区、所属机构
- 油站维度
  - 油站类型、油站名称、油站编号、客户编号、客户名称、省份、城市、县区、油站状态、所属公司
- 组织机构维度
  - 人员编号、人员名称、岗位编号、岗位名称、部门编号、部门名称
- 服务类型维度
  - 类型编号、类型名称
- 设备维度
  - 设备类型、设备编号、设备名称、油枪数量、泵类型、软件类型
- 故障类型维度
  - 一级故障编号、一级故障名称、二级故障编号、二级故障名称
- 物流公司维度
  - 物流公司编号、物流公司名称
- ……
小结
- 掌握一站制造业务维度设计

知识点08：一站制造业务主题维度矩阵

目标：了解一站制造业务主题的维度矩阵
实施

在这里插入图片描述

小结
- 了解一站制造业务主题的维度矩阵

知识点09：行政地区维度设计

目标：掌握行政地区维度的需求及设计
路径
- step1：需求
- step2：设计
实施
- 需求：构建行政地区维度表，得到所有省份、城市、县区及乡镇维度信息
  - 省份维度表
```
省份id	省份名称
```
  - 城市维度表
```
省份id	省份名称	城市id	城市名称
```
  - 县区维度表
```
省份id	省份名称	城市id	城市名称	县区id	县区名称
```
  - 乡镇维度表
```
省份id	省份名称	城市id	城市名称	县区id	县区名称	乡镇id	乡镇名称
11       北京市      1101     北京市      110108    海淀区    110108014  清华园街道
```
  - 统计不同地区维度下的网点个数、工单个数、报销金额等
- 设计
  - 数据来源：one_make_dwd.ciss_base_areas
```
select * from one_make_dwd.ciss_base_areas;
```
    - id：该地区数据id
    - parentid：该地区的父级行政区域的id
    - rank：表示行政地区的级别
      - 0：国家：country
      - 1：省份：province
      - 2：城市：city
      - 3：县/区：county
      - 4：乡镇/区域：town
    - areaname：地区名称
    - 举例
      - 清华园街道：4

在这里插入图片描述

    - 海淀区

在这里插入图片描述

    - 北京市【市级】

在这里插入图片描述

    - 北京市【省级】

在这里插入图片描述

- **实现思路**：以乡镇维度为例

  - 获取所有乡镇的信息

    ```sql
    select id area_id,areaname area,parentid from one_make_dwd.ciss_base_areas where rank = 4;
    ```

  - 获取所有县区的信息

    ```sql
    select id county_id,areaname county,parentid from one_make_dwd.ciss_base_areas where rank = 3;
    ```

  - 获取所有省份的信息

    ```sql
    select id city_id,areaname city,parentid from one_make_dwd.ciss_base_areas where rank = 2;
    ```

  - 获取所有省份的信息

    ```sql
    select id province_id,areaname province,parentid from one_make_dwd.ciss_base_areas where rank = 1;
    ```

  - 需求：获取每个镇的所有行政地区信息

    ```
    省份id	省份名称	城市id	城市名称	县区id	县区名称		乡镇id	乡镇名称
    ```

  - 实现：下一级地区的父id = 上一级地区的id

    ```sql
    select
        province_id,province,city_id,city,county_id,county,area_id,area
    from
         ( select id area_id,areaname area,parentid from one_make_dwd.ciss_base_areas where rank = 4 ) town
    join ( select id county_id,areaname county,parentid from one_make_dwd.ciss_base_areas where rank = 3 ) county
        on town.parentid = county.county_id
    join ( select id city_id,areaname city,parentid from one_make_dwd.ciss_base_areas where rank = 2 ) city
        on county.parentid = city.city_id
    join ( select id province_id,areaname province,parentid from one_make_dwd.ciss_base_areas where rank = 1 ) province
        on city.parentid = province.province_id;
    ```

小结
- 掌握行政地区维度的需求及设计

知识点10：行政地区维度构建

目标：实现行政地区维度表的构建

实施

建维度库

create database if not exists one_make_dws;

建维度表

区域粒度【乡镇】

create external table if not exists one_make_dws.dim_location_areas(
    id string comment 'id'
    , province_id string comment '省份ID'
    , province string comment '省份名称'
    , province_short_name string comment '省份短名称'
    , city_id string comment '城市ID'
    , city string comment '城市'
    , city_short_name string comment '城市短名称'
    , county_id string comment '县城ID'
    , county string comment '县城'
    , county_short_name string comment '县城短名称'
    , area_id string comment '区域ID'
    , area string comment '区域名称'
    , area_short_name string comment '区域短名称'
) comment '区域维度区域级别表'
stored as orc
tblproperties ("orc.compress"="SNAPPY")
location '/data/dw/dws/one_make/dim_location_areas';

县区粒度

create external table if not exists one_make_dws.dim_location_county(
    id string comment 'id'
    , province_id string comment '省份ID'
    , province string comment '省份名称'
    , province_short_name string comment '省份短名称'
    , city_id string comment '城市ID'
    , city string comment '城市'
    , city_short_name string comment '城市短名称'
    , county_id string comment '县城ID'
    , county string comment '县城'
    , county_short_name string comment '县城短名称'
) comment '区域维度表（县城粒度）'
stored as orc
tblproperties ("orc.compress"="SNAPPY")
location '/data/dw/dws/one_make/dim_location_county';

抽取数据

区域粒度

insert overwrite table one_make_dws.dim_location_areas
select
    /*+repartition(1) */
    t_area.id as id,
    t_province.id as province_id,
    t_province.areaname as province,
    t_province.shortname as province_short_name,
    t_city.id as city_id,
    t_city.areaname as city,
    t_city.shortname as city_short_name,
    t_county.id as county_id,
    t_county.areaname as county,
    t_county.shortname as county_short_name,
    t_area.id as area_id,
    t_area.areaname as area,
    t_area.shortname area_short_name
from
    one_make_dwd.ciss_base_areas t_area
    inner join one_make_dwd.ciss_base_areas t_county on t_area.rank = 4 and t_area.parentid = t_county.id
    inner join one_make_dwd.ciss_base_areas t_city on t_county.parentid = t_city.id
    inner join one_make_dwd.ciss_base_areas t_province on t_city.parentid = t_province.id
    inner join one_make_dwd.ciss_base_areas t_nation on t_province.parentid = t_nation.id
;

县区粒度

insert overwrite table one_make_dws.dim_location_county
select
    /*+repartition(1) */
    t_county.id as id,
    t_province.id as province_id,
    t_province.areaname as province,
    t_province.shortname as province_short_name,
    t_city.id as city_id,
    t_city.areaname as city,
    t_city.shortname as city_short_name,
    t_county.id as county_id,
    t_county.areaname as county,
    t_county.shortname as county_short_name
from
    one_make_dwd.ciss_base_areas t_county
    inner join one_make_dwd.ciss_base_areas t_city on t_county.rank =3 and t_county.parentid = t_city.id
    inner join one_make_dwd.ciss_base_areas t_province on t_city.parentid = t_province.id
    inner join one_make_dwd.ciss_base_areas t_nation on t_province.parentid = t_nation.id
;

SQL语法如何实现分区调整： /*+repartition(1) */

小结
- 实现行政地区维度表的构建
- 自行完善城市粒度、省份粒度

知识点11：日期时间维度设计

目标：掌握日期时间维度的需求与设计
路径
- step1：需求
- step2：设计

实施

需求：构建日期时间维度表，得到所有年、季度、月、周、日的维度信息

dateId	String	日期ID	20210101
yearNameCN	String	年份名称（中文）	2021年
year	Int	年份	2021
yearMonthId	String	年月ID	202101
yearNameCN	String	年月（中文）	2021年01月
quotaId	String	季度ID	2021Q1
quotaNameCN	String	季度名称（中文）	第一季度
quotaNameEN	String	季度名称（英文）	2021 Q1
quotaShortNameEN	String	季度名称（英文简写）	Q1
weekInYearId	String	周ID	2021W1
weekInYearNameCN	String	周（中文）	2021第1周
weekInYearNameEN	String	周（英文）	2021 W1
weekday	Int	星期	5
weekdayCN	String	星期（中文）	星期五
weekdayEN	String	星期（英文）	Friday
weekdayShortNameEN	String	星期（英文缩写）	Fri
yyyyMMdd	String	日期(yyyy-mm-dd)	2021-01-01
yyyyMMddCN	String	日期中文	2021年01月01日
isWorkday	String	是否工作日	n
isWeekend	String	是否周末	n
isHoliday	String	是否法定节假日	y
dateType	String	日期类型	法定节假日 \| 工作日 \| 周末

统计不同时间维度下的呼叫次数、工单数、维修数、安装数、报销金额、核销数等

设计
- 数据来源：提前通过程序生成构建，不是从数据中抽取的
```
2021  2021-10	4	42周		星期日		2021-10-10		yes	 no		周末
```
- 执行周期：每一年生成下一年的每一天的维度信息，增量同步到维度表的分区中

小结
- 掌握日期时间维度的需求与设计

知识点12：日期时间维度构建

目标：实现日期时间维度表的构建

实施

建维度表

-- 创建日期维度表,日期维度表按照年份分区
create external table if not exists one_make_dws.dim_date(
    date_id string comment '日期id'
    , year_name_cn string comment '年份名称（中文）'
    , year_month_id string comment '年月id'
    , year_month_cn string comment '年月（中文）'
    , quota_id string comment '季度id'
    , quota_namecn string comment '季度名称（中文）'
    , quota_nameen string comment '季度名称（英文）'
    , quota_shortnameen string comment '季度名称（英文简写）'
    , week_in_year_id string comment '周id'
    , week_in_year_name_cn string comment '周（中文）'
    , week_in_year_name_en string comment '周（英文）'
    , weekday int comment '星期'
    , weekday_cn string comment '星期（中文）'
    , weekday_en string comment '星期（英文）'
    , weekday_short_name_en string comment '星期（英文缩写）'
    , yyyymmdd string comment '日期_yyyy_mm_dd'
    , yyyymmdd_cn string comment '日期中文'
    , is_workday string comment '是否工作日'
    , is_weekend string comment '是否周末'
    , is_holiday string comment '是否法定节假日'
    , date_type string comment '日期类型'
) comment '时间维度表'
partitioned by (year integer)
stored as orc
location '/data/dw/dws/one_make/dim_date'
;

加载数据
- HDFS创建路径
```
hdfs dfs -mkdir -p /data/dw/dws/one_make/dim_date/2021
```
  - 可以使用rz方式
  yum install -y lrzsz
  进入Hadoop容器
  rz
  hdfs dfs -put part-00000-cf2fc4b3-7485-4861-81e7-da0c3f76e6de-c000.snappy.orc /data/dw/dws/one_make/dim_date/2021/

上传本地文件

node1上操作

mkdir -p /mnt/docker_share/data/DIM_DATE/2021
cd /mnt/docker_share/data/DIM_DATE/2021/
rz

Hadoop镜像

docker exec -it hadoop bash


```shell
hdfs dfs -put /mnt/docker_share/data/DIM_DATE/2021 /data/dw/dws/one_make/dim_date/

申明分区

alter table one_make_dws.dim_date add if not exists partition (year='2021') location '/data/dw/dws/one_make/dim_date/2021';

查看数据

select * from one_make_dws.dim_date;

小结
- 实现日期时间维度表的构建

知识点13：服务网点维度设计

目标：掌握服务网点维度的需求与设计
路径
- step1：需求
- step2：设计

实施

需求：构建服务网点维度表，得到服务网点id、网点名称、网点所属的地理区域、服务网点状态等

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5AHJXiRF-1671679742385)(Day1010_数仓维度层DWS层构建.assets/image-20211003085815855.png)]
- 统计不同服务网点的服务人员数、工单数、核销数等

设计

数据来源

ciss_base_servicestation：服务网点信息表

select
    id,name,code,province,city,region,status,org_id,org_name
from ciss_base_servicestation;

eos_dict_type：字典状态类别表，记录所有需要使用字典标记的表

--字典类别表：字典类型id和字典类型名称
select dicttypeid,dicttypename from eos_dict_type where dicttypename = '服务网点使用状态';

eos_dict_entry：字典状态明细表，记录所有具体的状态或者类别信息
```
--字典明细表：字典类型id、具体的编号和值
```

select dicttypeid,dictid,dictname from eos_dict_entry where dicttypeid = ‘BUSS_SERVICE_STATION_STATUS’;


  

- ciss_base_areas：行政地区信息表

  - 通过具体的id关联所有地区信息

    

- **实现设计**

```sql
select
    id,name,code,
  province provinceid,city cityid,region regionid,
    pro.provincename,tcity.cityname,tregion.regionname,
    status,
    b.dictname
    org_id,org_name
from one_make_dwd.ciss_base_servicestation a
join (select
        b.dictid,b.dictname
        from one_make_dwd.eos_dict_type a join one_make_dwd.eos_dict_entry b on a.dicttypeid = b.dicttypeid
        where a.dicttypename = '服务网点使用状态'
      ) b on a.status = b.dictid
join
(select id as provinceid,areaname as provincename from one_make_dwd.ciss_base_areas where rank = 1) pro on pro.provinceid = a.province
join
(select id as cityid,areaname as cityname from one_make_dwd.ciss_base_areas where rank = 2) tcity on tcity.cityid = a.city
join
(select id as regionid,areaname as regionname from one_make_dwd.ciss_base_areas where rank = 3 ) tregion on tregion.regionid = a.region;

小结
- 掌握服务网点维度的需求与设计

知识点14：服务网点维度构建

目标：实现服务网点维度的构建

实施

建维度表

-- 服务网点维度表
create external table if not exists one_make_dws.dim_srv_station(
    id string comment '服务网点id'
    , name string comment '服务网点名称'
    ,code string comment '网点编号'
    ,province_id string comment '省份id'
    ,province string comment '省份名称'
    ,city_id string comment '城市id'
    ,city string comment '城市'
    ,county_id string comment '县城id'
    ,county string comment '县城'
    ,status string comment '服务网点状态'
    ,status_name string comment '状态中文名'
    ,org_id string comment '所属组织机构id'
    ,org_name string comment '所属组件机构名称'
)comment '服务网点维度表'
partitioned by (dt string)
stored as orc
location '/data/dw/dws/one_make/dim_srv_station';

加载数据

insert overwrite table one_make_dws.dim_srv_station partition(dt='20210101')
select
    station.id
    , station.name
    , station.code
    , province.id as province_id
    , province.areaname as province
    , city.id as city_id
    , city.areaname as city
    , county.id as county_id
    , county.areaname as county
    , station.status as status
    , dict_e.dictname as status_name
    , station.org_id as org_id
    , station.org_name as org_name
from one_make_dwd.ciss_base_servicestation station
-- 关联省份RANK为1
left join one_make_dwd.ciss_base_areas province on station.dt = '20210101' and station.province = province.id and province.rank = 1   
-- 关联城市RANK为2
left join one_make_dwd.ciss_base_areas city on station.city = city.id and city.rank = 2 
-- 关联城市RANK为3
left join one_make_dwd.ciss_base_areas county on station.region = county.id and county.rank = 3
-- 关联字典父表（dict_t）
cross join one_make_dwd.eos_dict_type dict_t  on dict_t.dt = '20210101' and dict_t.dicttypename = '服务网点使用状态'
-- 关联字典子表（dict_e）
left join one_make_dwd.eos_dict_entry dict_e on dict_e.dt = '20210101' and dict_t.dicttypeid = dict_e.dicttypeid and station.status = dict_e.dictid;

cross join：笛卡尔积，join时候不给关联条件
- select * from A,B
- select * from A join B
- cross join
查看结果

在这里插入图片描述

小结
- 实现服务网点维度的构建

知识点15：油站维度设计

目标：掌握油站维度的需求与设计
路径
- step1：需求
- step2：设计
实施
- 需求：构建油站维度表，得到油站id、油站名称、油站所属的地理区域、所属公司、油站状态等

在这里插入图片描述

设计

数据来源

ciss_base_oilstation：油站信息表

select
    id, name, code,
    customer_id, customer_name,
    province, city, region, township,
    status, customer_classify, dt
from one_make_dwd.ciss_base_oilstation
where id != '' and name is not null and name != 'null' and customer_id is not null;

eos_dict_type：字典状态类别表，记录所有需要使用字典标记的表
```
select * from eos_dict_type where dicttypename = '油站状态';
```
eos_dict_entry：字典状态明细表，记录所有具体的状态或者类别信息
```
select * from eos_dict_entry where dicttypeid = 'BUSS_OILSTATION_STATUS';
```

ciss_base_baseinfo：客户公司信息表【公司ID、公司名称】

select ygcode, companyname from one_make_dwd.ciss_base_baseinfo group by ygcode, companyname;

数据有重复，做个去重

ciss_base_customer：客户信息表【客户id、客户省份名称、所属公司ID】
```
select code, province, company from one_make_dwd.ciss_base_customer;
```
ciss_base_areas：行政地区信息表
- 通过具体的id关联所有地区信息

实现设计
- 所有表按照对应字段关联，获取对应的属性字段

小结
- 掌握油站维度的需求与设计

知识点16：油站维度构建

目标：实现油站维度的构建

实施

建维度表

-- 创建油站维度表
create external table if not exists one_make_dws.dim_oilstation(
    id string comment '油站ID'
    , name string comment '油站名称'
    , code string comment '油站编码'
    , customer_id string comment '客户ID'
    , customer_name string comment '客户名称'
    , province_id int comment '省份id'
    , province_name string comment '省份名称'
    , city_id int comment '城市id'
    , city_name string comment '城市名称'
    , county_id int comment '县城ID'
    , county_name string comment '县城名称'
    , area_id int comment '区域id'
    , area_name string comment '区域名称'
    , customer_classify_id string comment '客户分类ID'
    , customer_classify_name string comment '客户分类名称'
    , status int comment '油站状态（1、2）'
    , status_name string comment '油站状态名（正常、停用）'
    , company_id int comment '所属公司ID'
    , company_name string comment '所属公司名称'
    , customer_province_id int comment '客户所属省份ID'
    , customer_province_name string comment '客户所属省份'
) COMMENT '油站维度表'
PARTITIONED BY (dt STRING)
STORED AS TEXTFILE
LOCATION '/data/dw/dws/one_make/dim_oilstation';

抽取数据

insert overwrite table one_make_dws.dim_oilstation partition (dt ='20210101')
select oil.id, oil.name, oil.code, customer_id, customer_name
       , oil.province province_id, p.areaname province_name
       , oil.city city_id, c.areaname city_name
       , oil.region county_id, county.areaname county_name
       , oil.township area_id, a.areaname area_name
       , oil.customer_classify customer_classify_id, ede.dictname customer_classify_name
       , oil.status status, eosde.dictname status_name
       , cbc.company company_id, binfo.companyname company_name
       , proname.id customer_province_id, proname.areaname customer_province_name
from (
     select id, name, code, customer_id, customer_name, province, city, region, township, status, customer_classify, dt
     from one_make_dwd.ciss_base_oilstation where id != '' and name is not null and name != 'null' and customer_id is not null
	 ) oil
     left join (select id, areaname, parentid from one_make_dwd.ciss_base_areas where rank = 1) p on oil.province = p.id
     left join (select id, areaname, parentid from one_make_dwd.ciss_base_areas where rank = 2) c on oil.city = c.id
     left join (select id, areaname, parentid from one_make_dwd.ciss_base_areas where rank = 3) county on oil.region = county.id
     left join (select id, areaname, parentid from one_make_dwd.ciss_base_areas where rank = 4) a on oil.township = a.id
     left join (select dictid, dictname  from one_make_dwd.eos_dict_entry) ede on oil.customer_classify = ede.dictid
     left join (select dictid, dictname from one_make_dwd.eos_dict_entry t1  left join one_make_dwd.eos_dict_type t2 on t1.dicttypeid = t2.dicttypeid where t2.dicttypename = '油站状态') eosde on oil.status = eosde.dictid
     -- 客户所属公司id，所属公司名称，所属省份id，所属省份名称
     left join (select code, province, company from one_make_dwd.ciss_base_customer) cbc on oil.customer_id = cbc.code
     left join (select id, areaname from one_make_dwd.ciss_base_areas where rank = 1 and id != 83) proname on cbc.province = proname.areaname
     left join (select ygcode, companyname from one_make_dwd.ciss_base_baseinfo group by ygcode, companyname) binfo on cbc.company = binfo.ygcode where dt = '20210101';

查看结果

在这里插入图片描述

小结
- 实现油站维度的构建

知识点17：其他维度：组织机构

目标：实现组织机构维度的设计及构建
路径
- step1：需求
- step2：设计
- step3：实现
实施
- 需求：实现组织机构维度表的构建，得到每个工程师对应的组织机构信息
- 统计不同服务人员的工单数、核销数等
- 设计
  - org_employee：员工信息表【员工id、员工编码、员工名称、用户系统id】
```
select empid,empcode,empname,userid from org_employee;
```
  - org_empposition：员工岗位信息表【员工id、岗位id】
```
select empid,positionid from org_empposition;
```
  - org_position：岗位信息表【岗位id、岗位编码、岗位名称、部门id】
```
select positionid,posicode,posiname,orgid from org_position;
```
  - org_organization：部门信息表【部门id、部门编码、部门名称】
```
select orgid,orgcode,orgname from org_organization;
```

在这里插入图片描述

实现

建维度表

-- 创建组织机构维度表，组织机构人员是经常变动的，所以按照日期分区
create external table if not exists one_make_dws.dim_emporg(
    empid string comment '人员id'   
    , empcode string comment '人员编码(erp对应的账号id)'
    , empname string comment '人员姓名'
    , userid string comment '用户系统id（登录用户名）'
    , posid string comment '岗位id'
    , posicode string comment '岗位编码'
    , posiname string comment '岗位名称'
    , orgid string comment '部门id'
    , orgcode string comment '部门编码'
    , orgname string comment '部门名称'
) comment '组织机构维度表'
partitioned by (dt string)
stored as orc
location '/data/dw/dws/one_make/dim_emporg';

抽取数据

-- 先根据dwd层的表进行关联，然后分别把数据取出来
insert overwrite table one_make_dws.dim_emporg partition(dt='20210101')
select
    emp.empid as empid
    , emp.empcode as empcode
    , emp.empname as empname
    , emp.userid as userid
    , pos.positionid as posid
    , pos.posicode as posicode
    , pos.posiname as posiname
    , org.orgid as orgid
    , org.orgcode as orgcode
    , org.orgname as orgname
from  one_make_dwd.org_employee emp
left join one_make_dwd.org_empposition emppos
    on emp.empid = emppos.empid and emp.dt = '20210101' and emppos.dt = '20210101'
left join one_make_dwd.org_position pos
    on emppos.positionid = pos.positionid and pos.dt = '20210101'
left join one_make_dwd.org_organization org
    on pos.orgid = org.orgid and org.dt = '20210101';

小结
- 实现组织机构维度的设计及构建

知识点18：其他维度：仓库、物流

目标：实现仓库维度、物流维度的构建
路径
- step1：仓库维度
- step2：物流维度

实施

仓库维度

建表

-- 仓库维度表
create external table if not exists one_make_dws.dim_warehouse(
    code string comment '仓库编码'
    , name string comment '仓库名称'
    , company_id string comment '所属公司'
    , company string comment '公司名称'
    , srv_station_id string comment '所属服务网点ID'
    , srv_station_name string comment '所属服务网点名称'
)comment '仓库维度表'
partitioned by (dt string)
stored as orc
location '/data/dw/dws/one_make/dim_warehouse';

加载

insert overwrite table one_make_dws.dim_warehouse partition(dt='20210101')
select
    warehouse.code as code
    , warehouse.name as name
    , warehouse.company as company_id
    , cmp.compmay as compmay
    , station.id as srv_station_id
    , station.name as srv_station_name
from
    one_make_dwd.ciss_base_warehouse warehouse
-- 关联公司信息表
left join (
     select
           ygcode as company_id, max(companyname) as compmay
     from one_make_dwd.ciss_base_baseinfo where dt='20210101'
     -- 需要对company信息进行分组去重，里面有一些重复数据 
     group by ygcode) cmp
     on warehouse.dt = '20210101' and cmp.company_id = warehouse.company
-- 关联服务网点和仓库关系表
left join one_make_dwd.ciss_r_serstation_warehouse station_r_warehouse
     on station_r_warehouse.dt = '20210101' and station_r_warehouse.warehouse_code = warehouse.code
-- 关联服务网点表 
left join one_make_dwd.ciss_base_servicestation station
     on station.dt = '20210101' and station.id = station_r_warehouse.service_station_id;

物流维度

建表

-- 物流维度表(和服务属性表类似)
create external table if not exists one_make_dws.dim_logistics(
    prop_name string comment '字典名称'
    , type_id string comment '属性id'
    , type_name string comment '属性名称'
)comment '物流维度表'
partitioned by (dt string)
stored as orc
location '/data/dw/dws/one_make/dim_logistics';

加载

insert overwrite table one_make_dws.dim_logistics partition(dt = '20210101')
select
    dict_t.dicttypename as prop_name
    , dict_e.dictid as type_id
    , dict_e.dictname as type_name
from  one_make_dwd.eos_dict_type dict_t
inner join one_make_dwd.eos_dict_entry dict_e
    on dict_t.dt = '20210101'
        and dict_e.dt = '20210101'
        and dict_t.dicttypeid = dict_e.dicttypeid
        and dict_t.dicttypename in (
            '物流公司'
            , '物流类型'
        )
order by dict_t.dicttypename, dict_e.dictid;

小结
- 实现仓库维度、物流维度的构建

附录一：常见问题

1.错误：没有开启Cross Join

Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans.Use the CROSS JOIN syntax to allow cartesian products between these relations

Spark2.x默认不允许执行笛卡尔积，除非显示申明cross join或者开启属性：spark.sql.crossJoin.enabled true

2.错误：Unable to move source

Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://hadoop.bigdata.cn:9000/data/dw/dws/one_make/dim_warehouse/.hive-staging_hive_2020-12-23_04-26-01_363_5663538019799519260-16/-ext-10000/part-00000-63069107-6405-4e31-a55a-6bdeefcd7d9b-c000 to destination hdfs://hadoop.bigdata.cn:9000/data/dw/dws/one_make/dim_warehouse/dt=20210101/part-00000-63069107-6405-4e31-a55a-6bdeefcd7d9b-c000; (state=,code=0)

wd.eos_dict_type dict_t
inner join one_make_dwd.eos_dict_entry dict_e
on dict_t.dt = ‘20210101’
and dict_e.dt = ‘20210101’
and dict_t.dicttypeid = dict_e.dicttypeid
and dict_t.dicttypename in (
‘物流公司’
, ‘物流类型’
)
order by dict_t.dicttypename, dict_e.dictid;
```

小结
- 实现仓库维度、物流维度的构建

附录一：常见问题

1.错误：没有开启Cross Join

Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans.Use the CROSS JOIN syntax to allow cartesian products between these relations

Spark2.x默认不允许执行笛卡尔积，除非显示申明cross join或者开启属性：spark.sql.crossJoin.enabled true

2.错误：Unable to move source

Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://hadoop.bigdata.cn:9000/data/dw/dws/one_make/dim_warehouse/.hive-staging_hive_2020-12-23_04-26-01_363_5663538019799519260-16/-ext-10000/part-00000-63069107-6405-4e31-a55a-6bdeefcd7d9b-c000 to destination hdfs://hadoop.bigdata.cn:9000/data/dw/dws/one_make/dim_warehouse/dt=20210101/part-00000-63069107-6405-4e31-a55a-6bdeefcd7d9b-c000; (state=,code=0)

重启SparkSQL的ThriftServer，与MetaStore构建新的会话连接

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

2023Python基础到应用进阶

python

hive

大数据