Run Parallel Grep For Faster Search
You have a huge file (~TBs) and you want to search a string and return the matching lines. As you know, grep will take the huge file and search the string line by line which is not sufficient enough and may take hours even days to finish a single job like this.
One way to improve this is splitting up the huge files into several parts and making the search separately on each part simultaneously.
Fortunately, you can do this with a single line with a combination of native Linux commands.
A native Linux executable called parallel helps for splitting the files so you do not have to pipe split or any other extra commands.
Here is the one line shell implementation of parallel grep
cat HugeFile.txt | parallel -j24 --pipe --block 2000M -- cat grep 'search-string' {} |
Now let me go step by step and explain how things work for those who are interested in…
cat HugeFile.txt |
Cat command simply reads the input and prints it as standard output for the next pipe.
parallel |
Parallel command is to run programs parallel like grep in this case.
Parallel command takes a lot of arguments (parameters/flags) which you can find them in its manual page but I d like to give you a website which I find it very useful and there are some amazing tutorials as well. It helped me a great deal to learn and apply grep easier and faster.
-j flag in code above is the maximum number of processors you want to use. I chose it to be 24 but you can change it the way you want depending on your computer specs. –pipe is for telling program parallel there are several inputs and they need to be piped separately. –block is the size of the each partitions. You can pick anything but the program recommends it to be less than 2GB for some reason(??). And then lastly –cat is to tell program to look for input in standard output format.
grep |
After that, you add the program you want to run parallel, which is grep in this case. You need to provide the search string just after grep and then the partition which is indicated as {}
With this similar technique, you can run any executable parallel.
I will write another article to show the implementation of the same code on cluster.
Please let me know if you tested this and/or find this article helpful by submitting your comment below. Feel free to ask any question you may have as well.
Is there a way to save the output to one file ?
I think you should be able to forward the stdout to a single file like this
>>file.out
. Give it a try and let me know…Ok so if i want to search with a pattern and save the result to a file. It will look like that ?
cat HugeFile.txt | parallel -j24 –pipe –block 2000M –cat grep ‘-f pattern-file’ {} >> result
Yeap that’s what i’d do.