建站工作室wordpress小工具滑至顶部
作业1
 1 了解行业名词 ERP CRM OA MES WMS RPA SAAS 了解每个系统的功能和应用  
 
 ERP 系统,(Enterprise Resource Planning,企业资源计划系统):ERP 系统  
 
 是一种用于管理企业各类资源的软件系统,包括生产管理、采购管理、库存管理、  
 
 财务管理等功能模块,帮助企业实现资源的优化配置和管理。  
 
 
 CRM 系统,(Customer Relationship Management,客户关系管理系统):CRM  
 
 系统是一种用于管理客户关系的软件系统,包括客户信息管理、销售管理、客户  
 
 服务管理等功能模块,帮助企业提高客户满意度和市场竞争力。  
 
 
 OA 系统,(Office Automation System,办公自动化系统):OA 系统是一种用  
 
 于协调、管理和优化办公流程的软件系统,包括电子邮件、日程安排、文档管理、  
 
 工作流程管理等功能模块,帮助企业提高工作效率和管理水平。  
 
 
 MES 系统(Manufacturing Execution System,制造执行系统):MES 系统是一  
 
 种用于管理制造过程的软件系统,包括生产计划管理、生产调度管理、工艺管理、  
 
 质量管理等功能模块,帮助企业提高生产效率和质量水平。  
 
 
 WMS 系统(Warehouse Management System,仓库管理系统):WMS 系统是一种用  
 
 于管理仓库的软件系统,包括入库管理、出库管理、库存管理、配送管理等功能  
 
 模块,帮助企业提高仓库管理效率和准确性。  
 
 
 SaaS(Software as a service),软件即服务,即通过网络提供软件服务,SaaS  
 
 平台供应商将应用软件统一部署在自己的服务器上,客户可以根据工作实际需  
 
 求,通过互联网向厂商定购所需的应用软件服务,按定购的服务多少和时间长短  
 
 向厂商支付费用,并通过互联网获得 Saas 平台供应商提供的服务。 
 
 
 
 2 模型设计 雪花模型 星型模型 ER 图 逻辑模型图...  
 
 星座模型  
 
 
 雪花模型  
 
 
 ER 图 
 
 
 逻辑模型图  
 
 
 3 检查自己的 Hadoop 集群,正常标准 444 进程。 myhadoop status 
 
 
作业2
 1 Hive 有哪些文件格式,为什么不用 textfile 格式  
   TextFile: 这是 Hadoop 的默认文件格式,它将数据以行的形式存储。每行由换  
  行符(\n)分隔。  
   SequenceFile: 这是 Hadoop 自带的一种二进制文件格式,它可以存储键值对,  
  序列化后存储。它支持压缩,而且有同步标记,容易分块,便于并行处理。  
   RCFile(Record Columnar File): 这是一种列存储格式,它将数据首先按列存  
  储,然后在每一列中再按行存储。这种格式适用于读操作多于写操作的场景,可  
  以有效提高查询性能。  
   ORC(Optimized Row Columnar): 这是 Hive 中支持的另一种列存储格式,相比  
  于 RCFile,ORC 文件格式具有更高的压缩比和查询效率。它支持 Zlib、Snappy  
  和 LZ4 压缩方式,并且提供了更高级别的向量化读取、投影和过滤优化。  
  Parquet:这是另一种列存储格式,它旨在提供跨平台的文件格式,可以很好地  
  与 Hadoop 生态系统中的其他组件(如 Impala、Presto 等)协同工作。Parquet  
  支持嵌套数据结构,并且可以与多种语言的客户端库一起使用。  
   不使用 TextFile 格式的原因主要有以下几点:  
  性能问题: TextFile 格式数据没有压缩,导致占用更多的存储空间,这会增加  
  存储成本和网络传输的负担。  
   读写效率低: TextFile 格式的数据不支持列存储,这对于列式存储优化的 Hive  
  查询引擎来说,效率较低,因为它不能有效地跳过不需要的列,从而增加了查询  
  时的 I/O 开销。  
   数据组织效率: 由于是按行存储,对于执行大量列筛选和聚合操作的查询,性能  
  不够优化。列存储格式如 ORC 和 Parquet 可以更有效地利用列索引、压缩和编码  
  技术,从而加快数据处理速度。  
   压缩支持: Hive 中的其他文件格式如 ORC 和 Parquet 提供了对数据压缩的支持,  
  可以有效减少存储空间并提高处理速度,而 TextFile 格式不支持压缩。  
   2 完成环境的部署  
  Python 安装:  
  hadoop100,hadoop101,hadoop102 都要安装  
  阿里云下载 python 安装包  
  wget  
  https://mirrors.aliyun.com/python-release/source/Python-3.9.10.tgz  
  在 hadoop100 下载完成后,通过 scp 传输到另外两个虚拟机  
  scp /root/Python-3.9.10.tgz  root@192.168.200.101:/root scp /root/Python-3.9.10.tgz  root@192.168.200.102:/root  
  我的 hadoop100 的 python 在上个阶段就装好了,直接在 hadoop101 和 hadoop102  
  再安装一次:  
  tar -zxvf Python-3.9.10.tgz  
  cd Python-3.9.10  
  安装依赖包:  
  yum install -y gcc patch libffi-devel python-devel zlib-devel  
  bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel  
  tk-devel gdbm-devel db4-devel libpcap-devel xz-devel  
  配置环境  
  ./configure --prefix=/usr/include/openssl  
  make && make install  
  建立安装快捷指令  
  ln -s /usr/include/openssl/bin/python3 /usr/bin/python3  
  ln -s /usr/include/openssl/bin/pip3 /usr/bin/pip3  
   python3 -m pip install --upgrade pip  
   pip3 config set global.index-url  
  https://pypi.tuna.tsinghua.edu.cn/simple  
   pip3 install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple  
  pymysql pyspark requests bs4 hdfs pandas numpy openpyxl  
   pip3 install urllib3==1.26.15  
  安装完成验证: 
 
  部署 Hive:  
  同步 Hive:  
  xsync /opt/module/apache-hive-3.1.2-bin  
  同步环境变量:  
  xsync /etc/profile.d/my_env.sh 
 
 
 
  部署调度平台:  
  cd /opt  
  mkdir xxljob  
  将‘Z:\班级\09_医药项目\xxl-job-student-20221220’中的所有文件上传到  
  xxljob 中 
 
  同步到其它虚拟机:  
  xsync xxljob  
  确保 mysql 账号密码均为 root  
 
  mysql 中执行 xxljob 建表文件  
  source /opt/xxljob/tables_xxl_job.sql; 
 
  编写启停脚本:  
  mkdir /root/bin  
  vim /root/bin/xxl  
  内容:  
 #!/bin/bash
act=$1
start()
{
echo "starting xxl-job"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-admin-2.3.0.jar > xxl-job.log 2>&1 &"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop101 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop102 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
}
stop(){
echo "stopping xxl-job"
ssh root@hadoop100 "ps -aux | grep xxl-job-admin | grep-v grep | awk '{print \$2}' | xargs kill -9"
ssh
root@hadoop100
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
}
status(){
echo "=============== hadoop102 ==============="
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop101 ==============="
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop100 ==============="
ssh root@hadoop100 "ps -aux | grep xxl-job | grep -v
grep"
}
case $act in
start)
start
status
;;
stop)
stop
status
;;
restart)
stop
start
;;
status)
status
;;
esac  增加执行权限:  
  chmod +x /root/bin/xxl  
  验证:  
  xxl start  
 
  netstat -nltp | grep 8080 
 
 
  安装 DataX:  
  wget  
  https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz  
  tar -zxvf datax.tar.gz  
  同步到其它节点:  
  xsync /opt/datax  
  3 完成第一个接入任务  
  客户提供的连接信息  
 
 
  需求:使用 DataX 每天凌晨 1 点定时抽取 c_org_busi 门店信息表  
  编写配置文件  
  打开 VScode:  
  mkdir -p /zhiyun/shihaihong  
  cd /zhiyun/shihaihong  
  mkdir data jobs shell sql python 读取  
  mysql  https://github.com/alibaba/DataX/blob/master/mysqlreader/doc  
  /mysqlreader.md  
  写入  
  hive  https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdf  
  swriter.md  
  job 目录中的 c_org_busi.json 文件下:  
 {
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": ["jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print":true
}
}
}
]
}
}  运行抽取:  
  python /opt/datax/bin/datax.py /zhiyun/shihaihong/jobs/c_org_busi.json  
 
  写入到 Hive  
  json 文件改为:  
 {
"job": {"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc",
"path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
{"name":"id","type": "int"},
{"name":"busno","type": "string"},{"name":"orgname","type": "string"},
{"name":"orgsubno","type":
"string"},
{"name":"orgtype","type": "string"},
{"name":"salegroup","type":
"string"},
{"name":"org_tran_code","type":
"string"},
{"name":"accno","type": "string"},
{"name":"sendtype","type":
"string"},
{"name":"sendday","type": "string"},
{"name":"maxday","type": "string"},
{"name":"minday","type": "string"},
{"name":"notes","type": "string"},
{"name":"stamp","type": "string"},
{"name":"status","type": "string"},
{"name":"customid","type":
"string"},
{"name":"whl_vendorno","type":
"string"},
{"name":"whlgroup","type":
"string"},
{"name":"rate","type": "string"},
{"name":"creditamt","type":
"string"},
{"name":"creditday","type":
"string"},
{"name":"peoples","type": "string"},
{"name":"area","type": "string"},
{"name":"abc","type": "string"},
{"name":"address","type": "string"},
{"name":"tel","type": "string"},
{"name":"principal","type":
"string"},
{"name":"identity_card","type":
"string"},
{"name":"mobil","type": "string"},
{"name":"corporation","type":
"string"},
{"name":"saler","type": "string"},
{"name":"createtime","type":
"string"},{"name":"bank","type": "string"},
{"name":"bankno","type": "string"},
{"name":"bak1","type": "string"},
{"name":"bak2","type": "string"},
{"name":"a_bak1","type": "string"},
{"name":"aa_bak1","type": "string"},
{"name":"b_bak1","type": "string"},
{"name":"bb_bak1","type": "string"},
{"name":"y_bak1","type": "string"},
{"name":"t_bak1","type": "string"},
{"name":"ym_bak1","type": "string"},
{"name":"tm_bak1","type": "string"},
{"name":"supervise_code","type":
"string"},
{"name":"monthrent","type":
"string"},
{"name":"wms_warehid","type":
"string"},
{"name":"settlement_cycle","type":
"string"},
{"name":"apply_cycle","type":
"string"},
{"name":"applydate","type":
"string"},
{"name":"accounttype","type":
"string"},
{"name":"applydate_last","type":
"string"},
{"name":"paymode","type": "string"},
{"name":"yaolian_flag","type":
"string"},
{"name":"org_longitude","type":
"string"},
{"name":"org_latitude","type":
"string"},
{"name":"org_province","type":
"string"},
{"name":"org_city","type":
"string"},
{"name":"org_area","type":
"string"},
{"name":"business_time","type":
"string"},{"name":"yaolian_group","type":
"string"},
{"name":"pacard_storeid","type":
"string"},
{"name":"opening_time","type":
"string"},
{"name":"ret_ent_id","type":
"string"},
{"name":"ent_id","type": "string"}
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}  打开 myhadoop 集群和 hive:  
 
  创建  hdfs  文件:  
  hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi 
 
  运行抽取:  
   python  
  /opt/datax/bin/datax.py  
  /zhiyun/shihaihong/jobs/c_org_busi.json  
 
  Hive 建表  
 -- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,
stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,
org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi"; 验证数据
 
  调度脚本  
  自动运行,自动更新  
  c_org_busi.sh:  
 #!/bin/bash
# 作用: 完成从编写配置文件到验证数据的整个过程
# 需要在任何节点都可以执行
echo "开始抽取 c_org_busi 门店信息表"
echo "生成配置文件"mkdir -p /zhiyun/shihaihong/jobs
echo '
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc","path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
...
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}' > /zhiyun/shihaihong/jobs/c_org_busi.json
echo "生成 HDFS 路径"
hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi
echo "抽取数据"
python /opt/datax/bin/datax.py
/zhiyun/shihaihong/jobs/c_org_busi.json
echo "hive 建表"
hive -e '
-- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况
create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,
ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi";
'
echo "验证数据"
hive -e '
select count(1) from ods_shihaihong.c_org_busi;
select * from ods_shihaihong.c_org_busi limit 2;
'
echo "抽取完成!"  赋权:  
  chmod u=rwx c_org_busi.sh  
  ./c_org_busi.sh 
 
  设置定时:  
 
  执行一次,查看执行日志:  
 
  执行一次 
 
  4 作业标准  
  4.1 所有截图必须全屏截图  
  4.2 把调度平台的日志复制到作业中 
   
