国产成人无码AⅤ片在线观看你,菠萝视频成人污污软件

最強(qiáng)最全面的Hive SQL開發(fā)指南，四萬字全面解析

2021-11-30 10:25

園陌

關(guān)注

本文整體分為兩部分，第一部分是簡(jiǎn)寫，如果能看懂會(huì)用，就直接從此部分查，方便快捷，如果不是很理解此SQl的用法，則查看第二部分，是詳細(xì)說明，當(dāng)然第二部分語句也會(huì)更全一些！

第一部分：

hive模糊搜索表：show tables like ＇＊name＊＇；

查看表結(jié)構(gòu)信息：desc table＿name；

查看分區(qū)信息：show partitions table＿name；

加載本地文件：load data local inpath ＇／xxx／test．txt＇ overwrite into table dm．table＿name；

從查詢語句給table插入數(shù)據(jù)：insert overwrite table table＿name partition（dt） select ＊ from table＿name；

導(dǎo)出數(shù)據(jù)到本地系統(tǒng)：insert overwrite local directory ＇／tmp／text＇ select a．＊ from table＿name a order by 1；

創(chuàng)建表時(shí)指定的一些屬性：

字段分隔符：row format delimited fields terminated by ＇＇

行分隔符：row format delimited lines terminated by ＇＇

文件格式為文本型存儲(chǔ)：stored as textfile

命令行操作：hive －e ＇select table＿cloum from table＇執(zhí)行一個(gè)查詢，在終端上顯示mapreduce的進(jìn)度，執(zhí)行完畢后，最后把查詢結(jié)果輸出到終端上，接著hive進(jìn)程退出，不會(huì)進(jìn)入交互模式

hive －S －e ＇select table＿cloum from table＇－S，終端上的輸出不會(huì)有mapreduce的進(jìn)度，執(zhí)行完畢，只會(huì)把查詢結(jié)果輸出到終端上。

hive修改表名：alter table old＿table＿name rename to new＿table＿name；

hive復(fù)制表結(jié)構(gòu)：create table new＿table＿name like table＿name；

hive添加字段：alter table table＿name add columns（columns＿values bigint comment ＇comm＿text＇）；

hive修改字段：alter table table＿name change old＿column new＿column string comment ＇comm＿text＇；

刪除分區(qū)：alter table table＿name drop partition（dt＝＇2021－11－30＇）；

添加分區(qū)：alter table table＿name add partition （dt＝＇2021－11－30＇）；

刪除空數(shù)據(jù)庫(kù)：drop database myhive2；

強(qiáng)制刪除數(shù)據(jù)庫(kù)：drop database myhive2 cascade；

刪除表：drop table score5；

清空表：truncate table score6；

向hive表中加載數(shù)據(jù)

直接向分區(qū)表中插入數(shù)據(jù)：insert into table score partition（month ＝＇202107＇） values （＇001＇，＇002＇，＇100＇）；通過load方式加載數(shù)據(jù)：load data local inpath ＇／export／servers／hivedatas／score．csv＇ overwrite into table score partition（month＝＇201806＇）；通過查詢方式加載數(shù)據(jù)：insert overwrite table score2 partition（month ＝＇202106＇） select s＿id，c＿id，s＿score from score1；查詢語句中創(chuàng)建表并加載數(shù)據(jù)：create table score2 as select ＊ from score1；在創(chuàng)建表是通過location指定加載數(shù)據(jù)的路徑：create external table score6 （s＿id string，c＿id string，s＿score int） row format delimited fields terminated by ＇，＇ location ＇／myscore＇；export導(dǎo)出與import 導(dǎo)入 hive表數(shù)據(jù)（內(nèi)部表操作）：

create table techer2 like techer；－－依據(jù)已有表結(jié)構(gòu)創(chuàng)建表

export table techer to ＇／export／techer＇；

import table techer2 from ＇／export／techer＇；

20． hive表中數(shù)據(jù)導(dǎo)出

insert導(dǎo)出

將查詢的結(jié)果導(dǎo)出到本地：insert overwrite local directory ＇／export／servers／exporthive＇ select ＊ from score；

將查詢的結(jié)果格式化導(dǎo)出到本地：insert overwrite local directory ＇／export／servers／exporthive＇ row format delimited fields terminated by ＇＇ collection items terminated by ＇＃＇ select ＊ from student；

將查詢的結(jié)果導(dǎo)出到HDFS上（沒有l(wèi)ocal）：insert overwrite directory ＇／export／servers／exporthive＇ row format delimited fields terminated by ＇＇ collection items terminated by ＇＃＇ select ＊ from score；

Hadoop命令導(dǎo)出到本地：dfs －get ／export／servers／exporthive／000000＿0 ／export／servers／exporthive／local．txt；hive shell 命令導(dǎo)出

基本語法：（hive －f／－e 執(zhí)行語句或者腳本＞ file）hive －e ＂select ＊ from myhive．score；＂＞／export／servers／exporthive／score．txt

hive －f export．sh ＞／export／servers／exporthive／score．txt

export導(dǎo)出到HDFS上：export table score to ＇／export／exporthive／score＇；Hive查詢語句

GROUP BY 分組：select s＿id ，avg（s＿score） avgscore from score group by s＿id having avgscore ＞ 85；對(duì)分組后的數(shù)據(jù)進(jìn)行篩選，使用 having

join 連接：inner join 內(nèi)連接；left join 左連接；right join 右鏈接；full join 全外鏈接。

order by 排序：ASC（ascend）：升序（默認(rèn)） DESC（descend）：降序

sort by 局部排序：每個(gè)MapReduce內(nèi)部進(jìn)行排序，對(duì)全局結(jié)果集來說不是排序。

distribute by 分區(qū)排序：類似MR中partition，進(jìn)行分區(qū)，結(jié)合sort by使用

Hive函數(shù)1．聚合函數(shù)

指定列值的數(shù)目：count（）

指定列值求和：sum（）

指定列的最大值：max（）

指定列的最小值：min（）

指定列的平均值：avg（）

非空集合總體變量函數(shù)：var＿pop（col）

非空集合樣本變量函數(shù)：var＿samp （col）

總體標(biāo)準(zhǔn)偏離函數(shù)：stddev＿pop（col）

分位數(shù)函數(shù)：percentile（BIGINT col， p）

中位數(shù)函數(shù)：percentile（BIGINT col， 0．5）

2．關(guān)系運(yùn)算

A LIKE B：LIKE比較，如果字符串A符合表達(dá)式B 的正則語法，則為TRUE

A RLIKE B：JAVA的LIKE操作，如果字符串A符合JAVA正則表達(dá)式B的正則語法，則為TRUE

A REGEXP B：功能與RLIKE相同

3．數(shù)學(xué)運(yùn)算

支持所有數(shù)值類型：加（＋）、減（－）、乘（＊）、除（／）、取余（％）、位與（＆）、位或（｜）、位異或（＾）、位取反（～）

4．邏輯運(yùn)算

支持：邏輯與（and）、邏輯或（or）、邏輯非（not）

5．數(shù)值運(yùn)算

取整函數(shù)：round（double a）

指定精度取整函數(shù)：round（double a， int d）

向下取整函數(shù)：floor（double a）

向上取整函數(shù)：ceil（double a）

取隨機(jī)數(shù)函數(shù)：rand（），rand（int seed）

自然指數(shù)函數(shù)：exp（double a）

以10為底對(duì)數(shù)函數(shù)：log10（double a）

以2為底對(duì)數(shù)函數(shù)：log2（）

對(duì)數(shù)函數(shù)：log（）

冪運(yùn)算函數(shù)：pow（double a， double p）

開平方函數(shù)：sqrt（double a）

二進(jìn)制函數(shù)：bin（BIGINT a）

十六進(jìn)制函數(shù)：hex（）

絕對(duì)值函數(shù)：abs（）

正取余函數(shù)：pmod（）

6．條件函數(shù)

case when

coalesce（c1，c2，c3）

nvl（c1，c2）

7．日期函數(shù)

獲得當(dāng)前時(shí)區(qū)的UNIX時(shí)間戳： unix＿timestamp（）

時(shí)間戳轉(zhuǎn)日期函數(shù)：from＿unixtime（）

日期轉(zhuǎn)時(shí)間戳：unix＿timestamp（string date）

日期時(shí)間轉(zhuǎn)日期函數(shù)：to＿date（string timestamp）

日期轉(zhuǎn)年函數(shù)：year（string date）

日期轉(zhuǎn)月函數(shù)：month （string date）

日期轉(zhuǎn)天函數(shù)： day （string date）

日期轉(zhuǎn)小時(shí)函數(shù)： hour （string date）

日期轉(zhuǎn)分鐘函數(shù)：minute （string date）

日期轉(zhuǎn)秒函數(shù)： second （string date）

日期轉(zhuǎn)周函數(shù)： weekofyear （string date）

日期比較函數(shù)： datediff（string enddate， string startdate）

日期增加函數(shù)： date＿add（string startdate， int days）

日期減少函數(shù)：date＿sub （string startdate， int days）

8．字符串函數(shù)

字符串長(zhǎng)度函數(shù)：length（string A）

字符串反轉(zhuǎn)函數(shù)：reverse（string A）

字符串連接函數(shù)： concat（string A， string B…）

帶分隔符字符串連接函數(shù)：concat＿ws（string SEP， string A， string B…）

字符串截取函數(shù)： substr（string A， int start， int len）

字符串轉(zhuǎn)大寫函數(shù)： upper（string A）

字符串轉(zhuǎn)小寫函數(shù)：lower（string A）

去空格函數(shù)：trim（string A）

左邊去空格函數(shù)：ltrim（string A）

右邊去空格函數(shù)：rtrim（string A）

正則表達(dá)式替換函數(shù)：regexp＿replace（string A， string B， string C）

正則表達(dá)式解析函數(shù)： regexp＿extract（string subject， string pattern， int index）

URL解析函數(shù)：parse＿url（string urlString， string partToExtract ［， string keyToExtract］）返回值： string

json解析函數(shù)：get＿json＿object（string json＿string， string path）

空格字符串函數(shù)：space（int n）

重復(fù)字符串函數(shù)：repeat（string str， int n）

首字符ascii函數(shù)：ascii（string str）

左補(bǔ)足函數(shù)：lpad（string str， int len， string pad）

右補(bǔ)足函數(shù)：rpad（string str， int len， string pad）

分割字符串函數(shù)： split（string str， string pat）

集合查找函數(shù)： find＿in＿set（string str， string strList）

9．窗口函數(shù)

分組求和函數(shù)：sum（pv） over（partition by cookieid order by createtime）有坑，加不加 order by 差別很大，具體詳情在下面第二部分。

分組內(nèi)排序，從1開始順序排：ROW＿NUMBER（）如：1234567

分組內(nèi)排序，排名相等會(huì)在名次中留下空位：RANK（）如：1233567

分組內(nèi)排序，排名相等不會(huì)在名次中留下空位：DENSE＿RANK（）如：1233456

有序的數(shù)據(jù)集合平均分配到指定的數(shù)量（num）個(gè)桶中：NTILE（）

統(tǒng)計(jì)窗口內(nèi)往上第n行值：LAG（col，n，DEFAULT）

統(tǒng)計(jì)窗口內(nèi)往下第n行值：LEAD（col，n，DEFAULT）

分組內(nèi)排序后，截止到當(dāng)前行，第一個(gè)值：FIRST＿VALUE（col）

分組內(nèi)排序后，截止到當(dāng)前行，最后一個(gè)值： LAST＿VALUE（col）

小于等于當(dāng)前值的行數(shù)／分組內(nèi)總行數(shù)：CUME＿DIST（）

以下函數(shù)建議看第二部分詳細(xì)理解下，此處僅簡(jiǎn)寫，！

將多個(gè)group by 邏輯寫在一個(gè)sql語句中： GROUPING SETS

根據(jù)GROUP BY的維度的所有組合進(jìn)行聚合：CUBE

CUBE的子集，以最左側(cè)的維度為主，從該維度進(jìn)行層級(jí)聚合：ROLLUP

第二部分1．對(duì)數(shù)據(jù)庫(kù)的操作創(chuàng)建數(shù)據(jù)庫(kù)：create database if not exists myhive；

說明：hive的表存放位置模式是由hive－site．xml當(dāng)中的一個(gè)屬性指定的：hive．metastore．warehouse．dir

創(chuàng)建數(shù)據(jù)庫(kù)并指定hdfs存儲(chǔ)位置：

create database myhive2 location ＇／myhive2＇；

修改數(shù)據(jù)庫(kù)：alter database myhive2 set dbproperties（＇createtime＇＝＇20210329＇）；

說明：可以使用alter database 命令來修改數(shù)據(jù)庫(kù)的一些屬性。但是數(shù)據(jù)庫(kù)的元數(shù)據(jù)信息是不可更改的，包括數(shù)據(jù)庫(kù)的名稱以及數(shù)據(jù)庫(kù)所在的位置

查看數(shù)據(jù)庫(kù)詳細(xì)信息查看數(shù)據(jù)庫(kù)基本信息

hive （myhive）＞ desc database myhive2；

查看數(shù)據(jù)庫(kù)更多詳細(xì)信息

hive （myhive）＞ desc database extended myhive2；

刪除數(shù)據(jù)庫(kù)刪除一個(gè)空數(shù)據(jù)庫(kù)，如果數(shù)據(jù)庫(kù)下面有數(shù)據(jù)表，那么就會(huì)報(bào)錯(cuò)

drop database myhive2；

強(qiáng)制刪除數(shù)據(jù)庫(kù)，包含數(shù)據(jù)庫(kù)下面的表一起刪除

drop database myhive cascade；

圖片標(biāo)題

注意：

如果使用 group by 分組，則 select 后面只能寫分組的字段或者聚合函數(shù)

where和having區(qū)別：

1 having是在 group by 分完組之后再對(duì)數(shù)據(jù)進(jìn)行篩選，所以having 要篩選的字段只能是分組字段或者聚合函數(shù)

2 where 是從數(shù)據(jù)表中的字段直接進(jìn)行的篩選的，所以不能跟在gruopby后面，也不能使用聚合函數(shù)

join 連接INNER JOIN 內(nèi)連接：只有進(jìn)行連接的兩個(gè)表中都存在與連接條件相匹配的數(shù)據(jù)才會(huì)被保留下來

select ＊ from techer t ［inner］ join course c on t．t＿id ＝ c．t＿id；－－ inner 可省略

LEFT OUTER JOIN 左外連接：左邊所有數(shù)據(jù)會(huì)被返回，右邊符合條件的被返回

select ＊ from techer t left join course c on t．t＿id ＝ c．t＿id；－－ outer可省略

RIGHT OUTER JOIN 右外連接：右邊所有數(shù)據(jù)會(huì)被返回，左邊符合條件的被返回、

select ＊ from techer t right join course c on t．t＿id ＝ c．t＿id；

FULL OUTER JOIN 滿外（全外）連接：將會(huì)返回所有表中符合條件的所有記錄。如果任一表的指定字段沒有符合條件的值的話，那

就使用NULL值替代。

SELECT ＊ FROM techer t FULL JOIN course c ON t．t＿id ＝ c．t＿id ；

注：1． hive2版本已經(jīng)支持不等值連接，就是 join on條件后面可以使用大于小于符號(hào)了；并且也支持 join on 條件后跟or （早前版本 on 后只支持＝和 and，不支持＞＜和 or）

2．如hive執(zhí)行引擎使用MapReduce，一個(gè)join就會(huì)啟動(dòng)一個(gè)job，一條sql語句中如有多個(gè)join，則會(huì)啟動(dòng)多個(gè)job

注意：表之間用逗號(hào)（，）連接和 inner join 是一樣的

select ＊ from table＿a，table＿b where table＿a．id＝table＿b．id；

它們的執(zhí)行效率沒有區(qū)別，只是書寫方式不同，用逗號(hào)是sql 89標(biāo)準(zhǔn)，join 是sql 92標(biāo)準(zhǔn)。用逗號(hào)連接后面過濾條件用 where ，用 join 連接后面過濾條件是 on。

order by 排序全局排序，只會(huì)有一個(gè)reduce

ASC（ascend）：升序（默認(rèn)） DESC（descend）：降序

SELECT ＊ FROM student s LEFT JOIN score sco ON s．s＿id ＝ sco．s＿id ORDER BY sco．s＿score DESC；

注意：order by 是全局排序，所以最后只有一個(gè)reduce，也就是在一個(gè)節(jié)點(diǎn)執(zhí)行，如果數(shù)據(jù)量太大，就會(huì)耗費(fèi)較長(zhǎng)時(shí)間

sort by 局部排序每個(gè)MapReduce內(nèi)部進(jìn)行排序，對(duì)全局結(jié)果集來說不是排序。

設(shè)置reduce個(gè)數(shù)

set mapreduce．job．reduces＝3；

查看設(shè)置reduce個(gè)數(shù)

set mapreduce．job．reduces；

查詢成績(jī)按照成績(jī)降序排列

select ＊ from score sort by s＿score；

將查詢結(jié)果導(dǎo)入到文件中（按照成績(jī)降序排列）

insert overwrite local directory ＇／export／servers／hivedatas／sort＇ select ＊ from score sort by s＿score；

distribute by 分區(qū)排序distribute by：類似MR中partition，進(jìn)行分區(qū)，結(jié)合sort by使用

設(shè)置reduce的個(gè)數(shù)，將我們對(duì)應(yīng)的s＿id劃分到對(duì)應(yīng)的reduce當(dāng)中去

set mapreduce．job．reduces＝7；

通過distribute by 進(jìn)行數(shù)據(jù)的分區(qū)

＊ from score distribute by s＿id sort by s＿score；

注意：Hive要求 distribute by 語句要寫在 sort by 語句之前

cluster by當(dāng)distribute by和sort by字段相同時(shí)，可以使用cluster by方式．

cluster by除了具有distribute by的功能外還兼具sort by的功能。但是排序只能是正序排序，不能指定排序規(guī)則為ASC或者DESC。

以下兩種寫法等價(jià)

select ＊ from score cluster by s＿id；

select ＊ from score distribute by s＿id sort by s＿id；

Hive函數(shù)聚合函數(shù)hive支持 count（），max（），min（），sum（），avg（）等常用的聚合函數(shù)

注意：

聚合操作時(shí)要注意null值

count（＊）包含null值，統(tǒng)計(jì)所有行數(shù)

count（id）不包含null值

min 求最小值是不包含null，除非所有值都是null

avg 求平均值也是不包含null

非空集合總體變量函數(shù)： var＿pop語法： var＿pop（col）

返回值： double

說明：統(tǒng)計(jì)結(jié)果集中col非空集合的總體變量（忽略null）

非空集合樣本變量函數(shù)： var＿samp語法： var＿samp （col）

返回值： double

說明：統(tǒng)計(jì)結(jié)果集中col非空集合的樣本變量（忽略null）

總體標(biāo)準(zhǔn)偏離函數(shù)： stddev＿pop語法： stddev＿pop（col）

返回值： double

說明：該函數(shù)計(jì)算總體標(biāo)準(zhǔn)偏離，并返回總體變量的平方根，其返回值與VAR＿POP函數(shù)的平方根相同

中位數(shù)函數(shù)： percentile語法： percentile（BIGINT col， p）

返回值： double

說明：求準(zhǔn)確的第pth個(gè)百分位數(shù)，p必須介于0和1之間，但是col字段目前只支持整數(shù)，不支持浮點(diǎn)數(shù)類型

關(guān)系運(yùn)算支持：等值（＝）、不等值（�。� 或＜＞）、小于（＜）、小于等于（＜＝）、大于（＞）、大于等于（＞＝）

空值判斷（is null）、非空判斷（is not null）

LIKE比較： LIKE語法： A LIKE B

操作類型： strings

描述：如果字符串A或者字符串B為NULL，則返回NULL；如果字符串A符合表達(dá)式B 的正則語法，則為TRUE；否則為FALSE。B中字符”＿”表示任意單個(gè)字符，而字符”％”表示任意數(shù)量的字符。

JAVA的LIKE操作： RLIKE語法： A RLIKE B

操作類型： strings

描述：如果字符串A或者字符串B為NULL，則返回NULL；如果字符串A符合JAVA正則表達(dá)式B的正則語法，則為TRUE；否則為FALSE。

REGEXP操作： REGEXP語法： A REGEXP B

操作類型： strings

描述：功能與RLIKE相同

示例：select 1 from tableName where ＇footbar＇ REGEXP ＇＾f．＊r＄＇；

結(jié)果：1

數(shù)學(xué)運(yùn)算支持所有數(shù)值類型：加（＋）、減（－）、乘（＊）、除（／）、取余（％）、位與（＆）、位或（｜）、位異或（＾）、位取反（～）

邏輯運(yùn)算支持：邏輯與（and）、邏輯或（or）、邏輯非（not）

數(shù)值運(yùn)算取整函數(shù)： round語法： round（double a）

返回值： BIGINT

說明：返回double類型的整數(shù)值部分（遵循四舍五入）

示例：select round（3．1415926） from tableName；

結(jié)果：3

指定精度取整函數(shù)： round語法： round（double a， int d）

返回值： DOUBLE

說明：返回指定精度d的double類型

hive＞ select round（3．1415926，4） from tableName；

3．1416

向下取整函數(shù)： floor語法： floor（double a）

返回值： BIGINT

說明：返回等于或者小于該double變量的最大的整數(shù)

hive＞ select floor（3．641） from tableName；

向上取整函數(shù)： ceil語法： ceil（double a）

返回值： BIGINT

說明：返回等于或者大于該double變量的最小的整數(shù)

hive＞ select ceil（3．1415926） from tableName；

取隨機(jī)數(shù)函數(shù)： rand語法： rand（），rand（int seed）

返回值： double

說明：返回一個(gè)0到1范圍內(nèi)的隨機(jī)數(shù)。如果指定種子seed，則會(huì)等到一個(gè)穩(wěn)定的隨機(jī)數(shù)序列

hive＞ select rand（） from tableName；－－每次執(zhí)行此語句得到的結(jié)果都不同

0．5577432776034763

hive＞ select rand（100）；－－只要指定種子，每次執(zhí)行此語句得到的結(jié)果一樣的

0．7220096548596434

自然指數(shù)函數(shù)： exp語法： exp（double a）

返回值： double

說明：返回自然對(duì)數(shù)e的a次方

hive＞ select exp（2）；

7．38905609893065

以10為底對(duì)數(shù)函數(shù)： log10語法： log10（double a）

返回值： double

說明：返回以10為底的a的對(duì)數(shù)

hive＞ select log10（100）；

2．0

此外還有：以2為底對(duì)數(shù)函數(shù)： log2（）、對(duì)數(shù)函數(shù)： log（）

冪運(yùn)算函數(shù)： pow語法： pow（double a， double p）

返回值： double

說明：返回a的p次冪

hive＞ select pow（2，4）；

16．0

開平方函數(shù)： sqrt語法： sqrt（double a）

返回值： double

說明：返回a的平方根

hive＞ select sqrt（16）；

4．0

二進(jìn)制函數(shù)： bin語法： bin（BIGINT a）

返回值： string

說明：返回a的二進(jìn)制代碼表示

hive＞ select bin（7）；

111

圖片標(biāo)題

hive （hive＿explode）＞ select get＿json＿object（concat（＇｛＇，sale＿info＿1，＇｝＇），＇＄．source＇） as source，get＿json＿object（concat（＇｛＇，sale＿info＿1，＇｝＇），＇＄．monthSales＇） as monthSales，get＿json＿object（concat（＇｛＇，sale＿info＿1，＇｝＇），＇＄．userCount＇） as monthSales，get＿json＿object（concat（＇｛＇，sale＿info＿1，＇｝＇），＇＄．score＇） as monthSales from explode＿lateral＿view LATERAL VIEW explode（split（regexp＿replace（regexp＿replace（sale＿info，＇＼［＼｛＇，＇＇），＇｝］＇，＇＇），＇｝，＼｛＇））sale＿info as sale＿info＿1；

總結(jié)：

Lateral View通常和UDTF一起出現(xiàn)，為了解決UDTF不允許在select字段的問題。Multiple Lateral View可以實(shí)現(xiàn)類似笛卡爾乘積。Outer關(guān)鍵字可以把不輸出的UDTF的空結(jié)果，輸出成NULL，防止丟失數(shù)據(jù)。

行轉(zhuǎn)列

相關(guān)參數(shù)說明：

CONCAT（string A／col， string B／col…）：返回輸入字符串連接后的結(jié)果，支持任意個(gè)輸入字符串；

CONCAT＿WS（separator， str1， str2，．．．）：它是一個(gè)特殊形式的 CONCAT（）。第一個(gè)參數(shù)剩余參數(shù)間的分隔符。分隔符可以是與剩余參數(shù)一樣的字符串。如果分隔符是 NULL，返回值也將為 NULL。這個(gè)函數(shù)會(huì)跳過分隔符參數(shù)后的任何 NULL 和空字符串。分隔符將被加到被連接的字符串之間；

COLLECT＿SET（col）：函數(shù)只接受基本數(shù)據(jù)類型，它的主要作用是將某字段的值進(jìn)行去重匯總，產(chǎn)生array類型字段。

數(shù)據(jù)準(zhǔn)備：

nameconstellationblood＿type孫悟空白羊座A老王射手座A宋宋白羊座B豬八戒白羊座A鳳姐射手座A

需求：把星座和血型一樣的人歸類到一起。結(jié)果如下：

射手座，A 老王｜鳳姐

白羊座，A 孫悟空｜豬八戒

白羊座，B 宋宋

實(shí)現(xiàn)步驟：

創(chuàng)建本地constellation．txt，導(dǎo)入數(shù)據(jù)node03服務(wù)器執(zhí)行以下命令創(chuàng)建文件，注意數(shù)據(jù)使用進(jìn)行分割

cd ／export／servers／hivedatas

vim constellation．txt

數(shù)據(jù)如下：

孫悟空白羊座 A

老王射手座 A

宋宋白羊座 B

豬八戒白羊座 A

鳳姐射手座 A

創(chuàng)建hive表并導(dǎo)入數(shù)據(jù)創(chuàng)建hive表并加載數(shù)據(jù)

hive （hive＿explode）＞ create table person＿info（

name string，

constellation string，

blood＿type string）

row format delimited fields terminated by ＂＂；

加載數(shù)據(jù)

hive （hive＿explode）＞ load data local inpath ＇／export／servers／hivedatas／constellation．txt＇ into table person＿info；

按需求查詢數(shù)據(jù)hive （hive＿explode）＞ select

t1．base，
concat＿ws（＇｜＇， collect＿set（t1．name）） name
from
（select
name，
concat（constellation，＂，＂， blood＿type） base
from
person＿info） t1
group by
t1．base；

列轉(zhuǎn)行

所需函數(shù)：

EXPLODE（col）：將hive一列中復(fù)雜的array或者map結(jié)構(gòu)拆分成多行。

LATERAL VIEW

用法：LATERAL VIEW udtf（expression） tableAlias AS columnAlias

解釋：用于和split， explode等UDTF一起使用，它能夠?qū)⒁涣袛?shù)據(jù)拆成多行數(shù)據(jù)，在此基礎(chǔ)上可以對(duì)拆分后的數(shù)據(jù)進(jìn)行聚合。

數(shù)據(jù)準(zhǔn)備：

cd ／export／servers／hivedatas

vim movie．txt

文件內(nèi)容如下：數(shù)據(jù)字段之間使用進(jìn)行分割

《疑犯追蹤》懸疑，動(dòng)作，科幻，劇情

《Lie to me》懸疑，警匪，動(dòng)作，心理，劇情

《戰(zhàn)狼2》戰(zhàn)爭(zhēng)，動(dòng)作，災(zāi)難

需求：將電影分類中的數(shù)組數(shù)據(jù)展開。結(jié)果如下：

《疑犯追蹤》懸疑

《疑犯追蹤》動(dòng)作

《疑犯追蹤》科幻

《疑犯追蹤》劇情

《Lie to me》懸疑

《Lie to me》警匪

《Lie to me》動(dòng)作

《Lie to me》心理

《Lie to me》劇情

《戰(zhàn)狼2》戰(zhàn)爭(zhēng)

《戰(zhàn)狼2》動(dòng)作

《戰(zhàn)狼2》災(zāi)難

實(shí)現(xiàn)步驟：

創(chuàng)建hive表create table movie＿info（

movie string，

category array＜string＞）

row format delimited fields terminated by ＂＂

collection items terminated by ＂，＂；

加載數(shù)據(jù)load data local inpath ＂／export／servers／hivedatas／movie．txt＂ into table movie＿info；

按需求查詢數(shù)據(jù)select

movie，
category＿name

from

movie＿info lateral view explode（category） table＿tmp as category＿name；

reflect函數(shù)

reflect函數(shù)可以支持在sql中調(diào)用java中的自帶函數(shù)，秒殺一切udf函數(shù)。

需求1：使用java．lang．Math當(dāng)中的Max求兩列中最大值

實(shí)現(xiàn)步驟：

創(chuàng)建hive表create table test＿udf（col1 int，col2 int） row format delimited fields terminated by ＇，＇；

準(zhǔn)備數(shù)據(jù)并加載數(shù)據(jù)cd ／export／servers／hivedatas

vim test＿udf

文件內(nèi)容如下：

1，2

4，3

6，4

7，5

5，6

加載數(shù)據(jù)hive （hive＿explode）＞ load data local inpath ＇／export／servers／hivedatas／test＿udf＇ overwrite into table test＿udf；

使用java．lang．Math當(dāng)中的Max求兩列當(dāng)中的最大值hive （hive＿explode）＞ select reflect（＂java．lang．Math＂，＂max＂，col1，col2） from test＿udf；

需求2：文件中不同的記錄來執(zhí)行不同的java的內(nèi)置函數(shù)

實(shí)現(xiàn)步驟：

創(chuàng)建hive表hive （hive＿explode）＞ create table test＿udf2（class＿name string，method＿name string，col1 int ， col2 int） row format delimited fields terminated by ＇，＇；

準(zhǔn)備數(shù)據(jù)cd ／export／servers／hivedatas

vim test＿udf2

文件內(nèi)容如下：

java．lang．Math，min，1，2

java．lang．Math，max，2，3

加載數(shù)據(jù)hive （hive＿explode）＞ load data local inpath ＇／export／servers／hivedatas／test＿udf2＇ overwrite into table test＿udf2；

執(zhí)行查詢hive （hive＿explode）＞ select reflect（class＿name，method＿name，col1，col2） from test＿udf2；

需求3：判斷是否為數(shù)字

實(shí)現(xiàn)方式：

使用apache commons中的函數(shù)，commons下的jar已經(jīng)包含在hadoop的classpath中，所以可以直接使用。

select reflect（＂org．a(chǎn)pache．commons．lang．math．NumberUtils＂，＂isNumber＂，＂123＂）
Hive 窗口函數(shù)

窗口函數(shù)最重要的關(guān)鍵字是 partition by 和 order by

具體語法如下：XXX over （partition by xxx order by xxx）

特別注意：over（）里面的 partition by 和 order by 都不是必選的，over（）里面可以只有partition by，也可以只有order by，也可以兩個(gè)都沒有，大家需根據(jù)需求靈活運(yùn)用。

窗口函數(shù)我劃分了幾個(gè)大類，我們一類一類的講解。

1． SUM、AVG、MIN、MAX

講解這幾個(gè)窗口函數(shù)前，先創(chuàng)建一個(gè)表，以實(shí)際例子講解大家更容易理解。

首先創(chuàng)建用戶訪問頁(yè)面表：user＿pv

create table user＿pv（
cookieid string，－－用戶登錄的cookie，即用戶標(biāo)識(shí)
createtime string，－－日期
pv int －－頁(yè)面訪問量
）；

給上面這個(gè)表加上如下數(shù)據(jù)：

cookie1，2021－05－10，1

cookie1，2021－05－11，5

cookie1，2021－05－12，7

cookie1，2021－05－13，3

cookie1，2021－05－14，2

cookie1，2021－05－15，4

cookie1，2021－05－16，4

SUM（）使用

執(zhí)行如下查詢語句：

select cookieid，createtime，pv，

sum（pv） over（partition by cookieid order by createtime） as pv1

from user＿pv；

結(jié)果如下：（因命令行原因，下圖字段名和值是錯(cuò)位的，請(qǐng)注意辨別�。�

執(zhí)行如下查詢語句：

select cookieid，createtime，pv，

sum（pv） over（partition by cookieid ） as pv1

from user＿pv；

結(jié)果如下：

第一條SQL的over（）里面加 order by ，第二條SQL沒加order by ，結(jié)果差別很大

所以要注意了：

over（）里面加 order by 表示：分組內(nèi)從起點(diǎn)到當(dāng)前行的pv累積，如，11號(hào)的pv1＝10號(hào)的pv＋11號(hào)的pv， 12號(hào)＝10號(hào)＋11號(hào)＋12號(hào)；

over（）里面不加 order by 表示：將分組內(nèi)所有值累加。

AVG，MIN，MAX，和SUM用法一樣，這里就不展開講了，但是要注意 AVG，MIN，MAX 的over（）里面加不加 order by 也和SUM一樣，如 AVG 求平均值，如果加上 order by，表示分組內(nèi)從起點(diǎn)到當(dāng)前行的平局值，不是全部的平局值。MIN，MAX 同理。

2． ROW＿NUMBER、RANK、DENSE＿RANK、NTILE

還是用上述的用戶登錄日志表：user＿pv，里面的數(shù)據(jù)換成如下所示：

cookie1，2021－05－10，1

cookie1，2021－05－11，5
cookie1，2021－05－12，7
cookie1，2021－05－13，3
cookie1，2021－05－14，2
cookie1，2021－05－15，4
cookie1，2021－05－16，4
cookie2，2021－05－10，2
cookie2，2021－05－11，3
cookie2，2021－05－12，5
cookie2，2021－05－13，6
cookie2，2021－05－14，3
cookie2，2021－05－15，9
cookie2，2021－05－16，7
ROW＿NUMBER（）使用：

ROW＿NUMBER（）從1開始，按照順序，生成分組內(nèi)記錄的序列。

SELECT
cookieid，
createtime，
pv，
ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY pv desc） AS rn
FROM user＿pv；

結(jié)果如下：

RANK 和 DENSE＿RANK 使用：

RANK（）生成數(shù)據(jù)項(xiàng)在分組中的排名，排名相等會(huì)在名次中留下空位。

DENSE＿RANK（）生成數(shù)據(jù)項(xiàng)在分組中的排名，排名相等會(huì)在名次中不會(huì)留下空位。

SELECT
cookieid，
createtime，
pv，
RANK（） OVER（PARTITION BY cookieid ORDER BY pv desc） AS rn1，
DENSE＿RANK（） OVER（PARTITION BY cookieid ORDER BY pv desc） AS rn2，
ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY pv DESC） AS rn3
FROM user＿pv
WHERE cookieid ＝＇cookie1＇；

結(jié)果如下：

NTILE的使用：

有時(shí)會(huì)有這樣的需求：如果數(shù)據(jù)排序后分為三部分，業(yè)務(wù)人員只關(guān)心其中的一部分，如何將這中間的三分之一數(shù)據(jù)拿出來呢？NTILE函數(shù)即可以滿足。

ntile可以看成是：把有序的數(shù)據(jù)集合平均分配到指定的數(shù)量（num）個(gè)桶中，將桶號(hào)分配給每一行。如果不能平均分配，則優(yōu)先分配較小編號(hào)的桶，并且各個(gè)桶中能放的行數(shù)最多相差1。

然后可以根據(jù)桶號(hào)，選取前或后 n分之幾的數(shù)據(jù)。數(shù)據(jù)會(huì)完整展示出來，只是給相應(yīng)的數(shù)據(jù)打標(biāo)簽；具體要取幾分之幾的數(shù)據(jù)，需要再嵌套一層根據(jù)標(biāo)簽取出。

SELECT

cookieid，

createtime，

pv，

NTILE（2） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn1，

NTILE（3） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn2，

NTILE（4） OVER（ORDER BY createtime） AS rn3

FROM user＿pv

ORDER BY cookieid，createtime；

結(jié)果如下：

3． LAG、LEAD、FIRST＿VALUE、LAST＿VALUE

講解這幾個(gè)窗口函數(shù)時(shí)還是以實(shí)例講解，首先創(chuàng)建用戶訪問頁(yè)面表：user＿url

CREATE TABLE user＿url （
cookieid string，
createtime string，－－頁(yè)面訪問時(shí)間
url string －－被訪問頁(yè)面
）；

表中加入如下數(shù)據(jù)：

cookie1，2021－06－10 10：00：02，url2

cookie1，2021－06－10 10：00：00，url1

cookie1，2021－06－10 10：03：04，1url3

cookie1，2021－06－10 10：50：05，url6

cookie1，2021－06－10 11：00：00，url7

cookie1，2021－06－10 10：10：00，url4

cookie1，2021－06－10 10：50：01，url5

cookie2，2021－06－10 10：00：02，url22

cookie2，2021－06－10 10：00：00，url11

cookie2，2021－06－10 10：03：04，1url33

cookie2，2021－06－10 10：50：05，url66

cookie2，2021－06－10 11：00：00，url77

cookie2，2021－06－10 10：10：00，url44

cookie2，2021－06－10 10：50：01，url55

LAG的使用：

LAG（col，n，DEFAULT）用于統(tǒng)計(jì)窗口內(nèi)往上第n行值。

第一個(gè)參數(shù)為列名，第二個(gè)參數(shù)為往上第n行（可選，默認(rèn)為1），第三個(gè)參數(shù)為默認(rèn)值（當(dāng)往上第n行為NULL時(shí)候，取默認(rèn)值，如不指定，則為NULL）

SELECT cookieid，

createtime，

url，

ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn，

LAG（createtime，1，＇1970－01－01 00：00：00＇） OVER（PARTITION BY cookieid ORDER BY createtime） AS last＿1

time，

LAG（createtime，2） OVER（PARTITION BY cookieid ORDER BY createtime） AS last＿2＿time

FROM user＿url；

結(jié)果如下：

解釋：

last＿1＿time：指定了往上第1行的值，default為＇1970－01－01 00：00：00＇
cookie1第一行，往上1行為NULL，因此取默認(rèn)值 1970－01－01 00：00：00
cookie1第三行，往上1行值為第二行值，2021－06－10 10：00：02
cookie1第六行，往上1行值為第五行值，2021－06－10 10：50：01

last＿2＿time：指定了往上第2行的值，為指定默認(rèn)值
cookie1第一行，往上2行為NULL
cookie1第二行，往上2行為NULL
cookie1第四行，往上2行為第二行值，2021－06－10 10：00：02
cookie1第七行，往上2行為第五行值，2021－06－10 10：50：01

LEAD的使用：

與LAG相反

LEAD（col，n，DEFAULT）用于統(tǒng)計(jì)窗口內(nèi)往下第n行值。

第一個(gè)參數(shù)為列名，第二個(gè)參數(shù)為往下第n行（可選，默認(rèn)為1），第三個(gè)參數(shù)為默認(rèn)值（當(dāng)往下第n行為NULL時(shí)候，取默認(rèn)值，如不指定，則為NULL）

SELECT cookieid，

createtime，

url，

ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn，

LEAD（createtime，1，＇1970－01－01 00：00：00＇） OVER（PARTITION BY cookieid ORDER BY createtime） AS next＿＿time，

LEAD（createtime，2） OVER（PARTITION BY cookieid ORDER BY createtime） AS next＿2＿time

FROM user＿url；

結(jié)果如下：

FIRST＿VALUE的使用：

取分組內(nèi)排序后，截止到當(dāng)前行，第一個(gè)值。

SELECT cookieid，

createtime，

url，

ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn，

FIRST＿VALUE（url） OVER（PARTITION BY cookieid ORDER BY createtime） AS first1

FROM user＿url；

結(jié)果如下：

LAST＿VALUE的使用：

取分組內(nèi)排序后，截止到當(dāng)前行，最后一個(gè)值。

SELECT cookieid，

createtime，

url，

ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn，

LAST＿VALUE（url） OVER（PARTITION BY cookieid ORDER BY createtime） AS last1

FROM user＿url；

結(jié)果如下：

如果想要取分組內(nèi)排序后最后一個(gè)值，則需要變通一下：

SELECT cookieid，

createtime，

url，

ROW＿NUMBER（） OVER（PARTITION BY cookieid ORDER BY createtime） AS rn，

LAST＿VALUE（url） OVER（PARTITION BY cookieid ORDER BY createtime） AS last1，

FIRST＿VALUE（url） OVER（PARTITION BY cookieid ORDER BY createtime DESC） AS last2

FROM user＿url

ORDER BY cookieid，createtime；

注意上述SQL，使用的是 FIRST＿VALUE 的倒序取出分組內(nèi)排序最后一個(gè)值！

結(jié)果如下：

此處要特別注意order by

如果不指定ORDER BY，則進(jìn)行排序混亂，會(huì)出現(xiàn)錯(cuò)誤的結(jié)果

SELECT cookieid，

createtime，

url，

FIRST＿VALUE（url） OVER（PARTITION BY cookieid） AS first2

FROM user＿url；

結(jié)果如下：

上述 url2 和 url55 的createtime即不屬于最靠前的時(shí)間也不屬于最靠后的時(shí)間，所以結(jié)果是混亂的。

4． CUME＿DIST

先創(chuàng)建一張員工薪水表：staff＿salary

CREATE EXTERNAL TABLE staff＿salary （

dept string，

userid string，

sal int

）；

表中加入如下數(shù)據(jù)：

d1，user1，1000

d1，user2，2000

d1，user3，3000

d2，user4，4000

d2，user5，5000

CUME＿DIST的使用：

此函數(shù)的結(jié)果和order by的排序順序有關(guān)系。

CUME＿DIST：小于等于當(dāng)前值的行數(shù)／分組內(nèi)總行數(shù)。order默認(rèn)順序：正序

比如，統(tǒng)計(jì)小于等于當(dāng)前薪水的人數(shù)，所占總?cè)藬?shù)的比例。

SELECT

dept，

userid，

sal，

CUME＿DIST（） OVER（ORDER BY sal） AS rn1，

CUME＿DIST（） OVER（PARTITION BY dept ORDER BY sal） AS rn2

FROM staff＿salary；

結(jié)果如下：

解釋：

rn1：沒有partition，所有數(shù)據(jù)均為1組，總行數(shù)為5，

第一行：小于等于1000的行數(shù)為1，因此，1／5＝0．2

第三行：小于等于3000的行數(shù)為3，因此，3／5＝0．6

rn2：按照部門分組，dpet＝d1的行數(shù)為3

第二行：小于等于2000的行數(shù)為2，因此，2／3＝0．6666666666666666

5． GROUPING SETS、GROUPING＿＿ID、CUBE、ROLLUP

這幾個(gè)分析函數(shù)通常用于OLAP中，不能累加，而且需要根據(jù)不同維度上鉆和下鉆的指標(biāo)統(tǒng)計(jì)，比如，分小時(shí)、天、月的UV數(shù)。

還是先創(chuàng)建一個(gè)用戶訪問表：user＿date

CREATE TABLE user＿date （
month STRING，
day STRING，
cookieid STRING
）；

表中加入如下數(shù)據(jù)：

2021－03，2021－03－10，cookie1
2021－03，2021－03－10，cookie5
2021－03，2021－03－12，cookie7
2021－04，2021－04－12，cookie3
2021－04，2021－04－13，cookie2
2021－04，2021－04－13，cookie4
2021－04，2021－04－16，cookie4
2021－03，2021－03－10，cookie2
2021－03，2021－03－10，cookie3
2021－04，2021－04－12，cookie5
2021－04，2021－04－13，cookie6
2021－04，2021－04－15，cookie3
2021－04，2021－04－15，cookie2
2021－04，2021－04－16，cookie1
GROUPING SETS的使用：

grouping sets是一種將多個(gè)group by 邏輯寫在一個(gè)sql語句中的便利寫法。

等價(jià)于將不同維度的GROUP BY結(jié)果集進(jìn)行UNION ALL。

SELECT
month，
day，
COUNT（DISTINCT cookieid） AS uv，
GROUPING＿＿ID
FROM user＿date
GROUP BY month，day
GROUPING SETS （month，day）
ORDER BY GROUPING＿＿ID；

注：上述SQL中的GROUPING＿＿ID，是個(gè)關(guān)鍵字，表示結(jié)果屬于哪一個(gè)分組集合，根據(jù)grouping sets中的分組條件month，day，1是代表month，2是代表day。

結(jié)果如下：

上述SQL等價(jià)于：

SELECT month，
NULL as day，
COUNT（DISTINCT cookieid） AS uv，
1 AS GROUPING＿＿ID
FROM user＿date
GROUP BY month
UNION ALL
SELECT NULL as month，
day，
COUNT（DISTINCT cookieid） AS uv，
2 AS GROUPING＿＿ID
FROM user＿date
GROUP BY day；
CUBE的使用：

根據(jù)GROUP BY的維度的所有組合進(jìn)行聚合。

SELECT
month，
day，
COUNT（DISTINCT cookieid） AS uv，
GROUPING＿＿ID
FROM user＿date
GROUP BY month，day
WITH CUBE
ORDER BY GROUPING＿＿ID；

結(jié)果如下：

上述SQL等價(jià)于：

SELECT NULL，NULL，COUNT（DISTINCT cookieid） AS uv，0 AS GROUPING＿＿ID FROM user＿date
UNION ALL
SELECT month，NULL，COUNT（DISTINCT cookieid） AS uv，1 AS GROUPING＿＿ID FROM user＿date GROUP BY month
UNION ALL
SELECT NULL，day，COUNT（DISTINCT cookieid） AS uv，2 AS GROUPING＿＿ID FROM user＿date GROUP BY day
UNION ALL
SELECT month，day，COUNT（DISTINCT cookieid） AS uv，3 AS GROUPING＿＿ID FROM user＿date GROUP BY month，day；
ROLLUP的使用：

是CUBE的子集，以最左側(cè)的維度為主，從該維度進(jìn)行層級(jí)聚合。

比如，以month維度進(jìn)行層級(jí)聚合：

SELECT
month，
day，
COUNT（DISTINCT cookieid） AS uv，
GROUPING＿＿ID
FROM user＿date
GROUP BY month，day
WITH ROLLUP
ORDER BY GROUPING＿＿ID；

結(jié)果如下：