Awk Tips

by Galoisplusplus - 四 10 4月 2014
Tags #tech #CS #awk #bash #shell #脚本 #linux #Linux #UNIX

近期做毕设实验用了大量自动化脚本,借机做点awk的记录,其实都没神马技术含量,比较基础,大神们轻拍~

统计相关

最大最小值

1
awk 'NR == 1 { min = max = $1 }{ if ($1 > max) { max = $1; } if ($1 < min) { min = $1; } print min, max }'

总和、平均值

1
awk '{ sum = 0; for(i = 1; i <= NF; i++) { sum += $i; } avg = sum / NF; print sum, avg }'

中位数

中位数是其中较为麻烦的一项。首先需要把数据进行排序。如果数据的数目是奇数,那么中位数就是中间的那个数据;否则则是中间两个数据的算术平均值。

这里先假设数组已经排好序,那么我们可以用以下的awk脚本来处理:

1
awk '{ a[NR] = $1; } END { if (NR % 2) { print a[(NR + 1) / 2]; } else { print (a[NR/2] + a[NR/2 + 1]) / 2; } }'

对数组排序的部分我是用sort来做的,当然也可以用awkasort函数。

下面举一个实际的例子:以下是Table-100time文件,需要求每一行的以上各项统计值。其中第一列是行号,不列入统计。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1 1099.603271 625.161804 610.000183 626.810974 1123.747559 1108.142944 1118.393188 1097.084351 1115.050903 1120.240723 1105.154419 1113.885498 1113.874878 1113.099365 1111.149902 1123.582397 1114.776978 1101.797852 1113.121094 1077.672852 1122.087891 1124.917480 1113.838135 1107.574707 1112.188232 1122.553223 1121.828857 1113.358521 1111.792603 1122.107422 1106.260498 1122.025024 612.053528 575.379456 893.525818 1130.349243 614.986938 886.635559 895.587341 1115.676270 1113.105835 1116.982178 1113.773071 1115.756714 1103.920288 1115.773682 1071.231445 1115.358765 1122.167725 1126.683228 1117.181885 1118.453857 1120.608521 1110.655640 1110.968872 1110.284302 1108.375366 1112.826660 1118.803589 1116.816406 1120.060181 1110.062378 1109.777466 1104.469482 1109.132446 1097.460815 1118.085205 1114.240479 1115.615967 1100.754150 1105.984985 1112.360474 613.934448 1113.597
046 1122.492676 1108.549683 978.940369 831.665161 1123.176147 1146.721802 1156.943604 636.618469 645.765564 1153.194214 589.766296 1110.810913 1108.724976 1105.940796 1121.304443 1116.116577 992.231262 1150.437744 1108.551025 869.626648 1113.281372 574.230530 1111.740479 1111.650635 1105.286011 1114.131470
2 0.063136 0.015616 0.060320 0.014816 0.060768 0.060000 0.060896 0.061248 0.061120 0.059712 0.015968 0.062656 0.062208 0.061568 0.061632 0.061408 0.062048 0.062656 0.062112 0.069280 0.061728 0.063200 0.062112 0.060736 0.061056 0.062944 0.067616 0.060544 0.061152 0.060928 0.068000 0.061664 0.069120 0.063232 0.059808 0.059648 0.060512 0.014240 0.064032 0.060288 0.062144 0.060896 0.069280 0.061920 0.060992 0.061472 0.058848 0.059872 0.071296 0.062432 0.062592 0.061888 0.062848 0.060416 0.061440 0.066528 0.062944 0.059904 0.061856 0.065088 0.060064 0.061440 0.060288 0.061920 0.059936 0.061984 0.061024 0.061728 0.062048 0.060896 0.062336 0.060896 0.062240 0.062112 0.061600 0.059744 0.066912 0.014688 0.060512 0.062976 0.063008 0.070976 0.015936 0.063616 0.054592 0.060704 0.060832 0.061344 0.059968 0.060640 0.051264 0.062656 0.062400 0.059872 0.063648 0.056448 0.060768 0
.060576 0.061728 0.063040
3 0.028128 0.027616 0.026240 0.024736 0.024544 0.027072 0.026912 0.027328 0.027808 0.027456 0.031616 0.025504 0.026816 0.024672 0.024672 0.027200 0.027168 0.025344 0.025536 0.025888 0.026880 0.024384 0.027968 0.024096 0.027424 0.028544 0.025920 0.027072 0.026976 0.025472 0.029568 0.024768 0.032480 0.027744 0.026592 0.026784 0.024576 0.023328 0.027584 0.026720 0.026624 0.027872 0.034592 0.028640 0.026560 0.026720 0.026496 0.027072 0.025376 0.026368 0.025248 0.027904 0.027104 0.027616 0.027776 0.026912 0.026528 0.026816 0.028032 0.027136 0.027584 0.027232 0.026464 0.029056 0.027456 0.027168 0.027072 0.024576 0.027552 0.025088 0.026976 0.027168 0.026784 0.027424 0.026784 0.026912 0.035392 0.040512 0.028480 0.035488 0.027712 0.032832 0.026944 0.035296 0.024384 0.027552 0.024192 0.026592 0.027296 0.027392 0.029408 0.028160 0.023392 0.027040 0.024800 0.024448 0.027520 0.027488 0.026912 0.028352
4 720.253540 717.565735 715.817139 717.748474 718.543030 715.734741 717.143311 715.751343 715.887451 715.781006 718.229675 715.808960 719.805298 7
19.283142 716.932983 715.766663 715.821899 715.742126 718.466309 718.681396 715.724670 719.399963 715.779358 718.264465 715.671326 718.540466 715.748169 718.249634 716.992737 718.466797 716.194458 717.444519 717.247070 715.825623 717.434021 716.618286 715.681519 715.655029 715.848206 717.192139 718.314575 718.943909 718.412842 717.466309 717.111267 717.610962 715.848877 721.035767 716.860535 717.362793 715.776123 719.968140 715.818481 717.481934 715.704834 719.049866 720.205078 719.389282 717.336487 724.044067 717.548279 717.358459 717.179626 717.550232 718.712952 717.472229 720.543335 716.905273 718.393005 723.254089 715.945129 715.756897 715.725586 720.922485 715.875732 719.713623 715.756836 719.216492 715.723267 718.503784 722.632935 718.873718 715.936157 717.273560 721.847961 717.286560 718.792480 720.974731 719.742310 720.751404 715.751953 717.330139 717.142151 719.528503 718.555725 716.737610 719.197266 718.070374 720.632996 718.734375
5 516.273010 343.065674 656.750610 626.545898 709.740112 677.137573 709.918213 591.302612 698.008240 711.465088 659.250183 591.888367 705.064270 704.222046 656.991394 646.187683 704.989136 598.788391 703.356201 685.694885 719.180054 669.689514 642.316528 641.993225 640.972778 649.164490 716.
288208 708.772888 707.416382 704.347412 586.610229 636.456482 313.203125 377.224243 388.061737 641.782410 544.509583 397.306976 277.899902 546.433
533 639.272949 630.372742 648.397034 640.577515 642.346191 640.025818 299.147125 641.151489 642.511658 639.431702 458.824402 467.759552 465.540985
 520.931091 641.239502 649.059692 651.762390 667.363892 653.447388 655.450989 650.086426 646.288879 644.759949 641.802612 648.356995 637.783386 64
8.014709 642.980835 634.090210 640.974365 637.239380 642.302795 631.172485 595.006836 647.026001 651.397522 554.663391 327.290649 648.838074 691.1
94458 700.736389 334.126831 692.194702 701.258423 544.619934 515.694275 650.980957 647.714172 638.059814 643.856995 514.265930 677.103210 651.1687
01 306.423187 647.868347 495.291565 600.393921 622.958679 642.191162 651.052124
6 720.316650 717.581360 715.877441 717.763306 718.603821 715.794739 717.204224 715.812561 715.948547 715.840698 718.245667 715.871643 719.867493 7
19.344727 716.994629 715.828064 715.883972 715.804810 718.528442 718.750671 715.786377 719.463135 715.841492 718.325195 715.732361 718.603394 715.
815796 718.310181 717.053894 718.527710 716.262451 717.506165 717.316162 715.888855 717.493835 716.677917 715.742004 715.669250 715.912231 717.252
441 718.376709 719.004822 718.482117 717.528198 717.172241 717.672424 715.907715 721.095642 716.931824 717.425232 715.838745 720.030029 715.881348
 717.542358 715.766296 719.116394 720.268005 719.449158 717.398315 724.109131 717.608337 717.419922 717.239929 717.612122 718.772888 717.534241 72
0.604370 716.966980 718.455078 723.315002 716.007446 715.817810 715.787842 720.984619 715.937317 719.773376 715.823730 719.231201 715.783752 718.5
66772 722.695923 718.944702 715.952087 717.337158 721.902527 717.347290 718.853333 721.036072 719.802307 720.812073 715.803223 717.392822 717.2045
29 719.588379 718.619385 716.794067 719.258057 718.130920 720.694702 718.797424
7 1615.904297 968.255066 1266.777100 1253.381592 1833.512207 1785.307617 1828.338257 1688.414307 1813.086914 1831.733276 1764.436279 1705.799316 1
818.966064 1817.346069 1768.166016 1769.797363 1819.793335 1700.611572 1816.502808 1763.393555 1841.294800 1794.631348 1756.182617 1749.592041 175
3.188477 1771.746338 1838.142944 1822.158447 1819.235962 1826.480347 1692.900269 1758.506348 925.289124 952.631470 1281.614136 1772.158447 1159.52
1118 1283.965820 1173.514893 1662.136475 1752.405396 1747.382812 1762.204590 1756.362915 1746.293091 1755.826172 1370.405029 1756.537354 1764.7048
34 1766.141357 1576.031494 1586.241333 1586.176636 1631.614258 1752.236206 1759.370850 1760.164307 1780.217407 1772.279053 1772.294434 1770.174194
 1756.378418 1754.563965 1746.301147 1757.516846 1735.271484 1766.126953 1757.245850 1749.733765 1741.753662 1743.251343 1754.690430 1245.133789 1
708.631348 1769.545410 1759.974121 1533.639160 1158.996338 1772.042725 1837.951782 1857.707764 970.778137 1337.987183 1854.487915 1134.410645 1626
.532715 1759.730103 1753.681641 1759.391602 1760.000977 1506.526611 1827.569092 1759.743164 1176.076904 1761.174561 1069.546631 1712.161865 1734.6
36719 1747.504028 1765.211914
8 2340.071533 1689.605957 1992.927368 1975.017212 2555.832031 2504.885254 2559.110352 2407.793701 2532.718018 2551.314697 2521.514160 2425.311523 2542.544434 2540.422119 2488.926758 2489.232666 2554.735596 2420.110107 2538.753662 2486.242432 2560.750732 2517.833984 2475.721191 2471.649902 2472.583252 2494.048584 2557.614502 2544.040771 2540.066650 2548.778564 2412.888916 2479.761230 1658.584473 1672.889282 2002.672607 2492.464355 1878.801880 2003.776245 1893.851440 2383.073242 2474.485596 2470.081055 2484.501953 2477.659424 2479.066162 2490.251709 2091.606201 2481.179443 2485.382080 2487.158691 2295.618408 2325.068115 2305.705322 2352.801025 2472.260498 2482.116211 2483.993652 2503.430664 2493.295898 2500.223633 2491.596436 2477.452148 2475.478760 2467.430908 2480.087646 2456.450684 2490.367432 2477.877441 2471.899902 2469.067871 2463.063965 2474.175049 1977.751953 2433.125244 2489.358887 2492.018311 2375.827637 1900.094604 2491.512695 2560.443604 2584.575684 1712.179321 2058.241699 2575.792725 1863.588135 2347.553955 2482.250488 2478.283447 2482.996338 2484.401855 2321.362549 2549.145264 2480.642578 1900.880493 2483.422363 1789.904663 2435.025635 2456.335205 2472.076172 2487.675781
9 3513.309782 2274.641538 2958.232285 3932.669376 3786.126188 2849.826237 5700.254664 2816.492436 3832.746619 3702.420506 2870.630259 4930.524634 3626.383660 5457.465817 2911.674854 5826.706631 3063.118128 3468.595578 3017.246258 3891.089669 3560.750380 4003.229298 4297.342606 3008.550484 3794.148238 6804.888446 5513.992633 5023.367652 5088.919598 3093.680407 5314.446895 7401.844282 3076.878939 1882.124608 4415.104581 4005.727143 7425.384389 3842.665568 2190.021208 2748.342060 4561.191134 2963.751309 2996.609632 5380.739438 4874.685792 4581.670467 3519.148784 4109.830963 3950.539100 4080.477292 2624.230874 4931.277735 2882.874944 2807.562003 3849.681022 3160.494618 3747.134412 3917.208341 2737.369608 5956.795812 4988.859466 3666.625523 5128.424549 4919.731474 5295.796236 3325.609938 5754.361423 4226.240290 3721.524204 3620.667546 6138.143649 4914.947320 5018.477593 2904.721095 3774.158545 5614.797046 5422.969243 3521.075405 22676.448882 5749.432579 4491.578183 5113.269873 4966.596379 4017.531402 3255.207807 5896.712478 5836.628444 7583.357894 5566.682519 3725.477136 3611.446564 3415.802801 5165.599688 3288.680153 3752.247671 3340.728295 3451.722113 7003.061383 4741.995198 6129.824958

对于中位数,首先我们先考虑其中一行(例如第一行),先用cut把第一列去掉:

1
cat Table-100time | sed -n '1p' | cut -d " " -f 2- 

这里OT一下,sedcut恰好一个负责选取行,一个负责选取列。 这在脚本中是很好用的,以下我给出它们这种用法的一般形式:

sed选取{beginLine}{endLine}中间的行:

1
sed -n '{beginLine}, {endLine}p'

对应于

1
awk 'NR==2'

1
awk 'FNR==2'

cut选取{beginColumn}{endColumn}中间的列:

1
cut -f {beginColumn}-{endColumn}

之后我们需要用sort来排序,但sort只对行排序,我的方法是通过xargs把分隔符由空格转为换行:

1
cat Table-100time | sed -n '1p' | cut -d " " -f 2- | xargs -n1 echo | sort -n

接下来就可以pipe给awk脚本来处理了:

1
cat Table-100time | sed -n '1p' | cut -d " " -f 2- | xargs -n1 echo | sort -n | awk '{ a[NR] = $1; } END { if (NR % 2) { print a[(NR + 1) / 2]; } else { print (a[NR/2] + a[NR/2 + 1]) / 2; } }'

最后要处理的是Table-100time这种多行的表格

1
cat Table-100time | xargs -L1 -i bash -c $'echo {} | cut -d " " -f 2- | xargs -n1 echo | sort -n | awk \'{ a[NR] = $1; } END { if (NR % 2) { print a[(NR + 1) / 2]; } else { print (a[NR/2] + a[NR/2 + 1]) / 2; } }\''

多文件处理

除了FILENAME以外,在awk中辨别文件主要还有FNRNR这两个内置变量。具体定义可以参考官方文档

1
2
3
4
5
FNR
The current record number in the current file. awk increments FNR each time it reads a new record (see Records). awk resets FNR to zero each time it starts a new input file.

NR
The number of input records awk has processed since the beginning of the program’s execution (see Records). awk increments NR each time it reads a new record.

下面同样举一个具体例子:accurate-datasimulated-data都是以<key> <value>形式组织的数据,二者有着完全一致的key,而且事先排过序,需要求两者所记录的value的差值绝对值。

我用了以下的脚本,其实逻辑很简单,先把第一个文件的key-value pair存到一个hash map中,读到第二个文件相同key的值时便与hash map的值比较求绝对值:

1
2
3
awk 'NR == FNR { a[$1] = $NF }
NR > FNR && a[$1] > $NF { print $1"\t"a[$1]-$NF }
NR > FNR && a[$1] < $NF { print $1"\t"$NF-a[$1] }' accurate-data simulated-data > diff

其他

有次我需要比较hadoop程序的执行时间,由于hadoop job -history输出的log中的时间有不同单位,不便比较,所以我用以下命令做预处理再比较(初始数据存放在time-table中,输出到timeResult):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat time-table | sed 's/.*(\(.*\))/\0 \1/' | awk '
#BEGIN{
{
if(match($4,/(.+),(.+)/)){ 
min=gensub(/(.+)mins,(.+)sec/, "\\1", "g", $4)
sec=gensub(/(.+)mins,(.+)sec/, "\\2", "g", $4)
time=min*60+sec
#print min
#print sec
#print time
printf "%s %s %s\n", $1, $2, time >> "timeResult"
}
else
{
sec=gensub(/(.+)sec/, "\\1", "g", $4)
time=sec
#print sec
#print time
printf "%s %s %s\n", $1, $2, time >> "timeResult"
}
}
'

参考资料

[1]The GNU Awk User’s Guide

[2]ABS(Advanced Bash-Scripting Guide)

[3]8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Comments