d2jsp
Log InRegister
d2jsp Forums > Off-Topic > Computers & IT > Programming & Development > Parallel Files Processing
Add Reply New Topic New Poll
Member
Posts: 5,567
Joined: Apr 30 2006
Gold: 1,020.00
Oct 10 2016 11:19am
Hiho peeps,

I have a small issue.. i am writing a C++ program that is parallely calculating some statistics based on data read from files ... currently it works as follows : I keep a tbb::concurrent_hash_map with key as a string and vector of floats as value.. then i execute for loop which adds tasks a scheduler (24 cores), where every task looks like this:
1) Open a file with statistics ... use malloc to allocate memory chunk of the file size and fread it into memory - > close the file.
2) Call a loop which uses strtok_r to tokenize the line of data into vector of char* ...
3) When a given line is read and thus vector of char* has been prepared, i call a update statistics function on it, which works on the concurrent map itself. The map lock is per key and the disparity between keys is huge, thus there shouldnt be too much waiting for a given lock to be released. If a key has already been present in the map i update the value - vector of floats by merging it together with the vector of char* (the line).. if it hasn't been into the map - the vector of char* is used to prepare what-so-called new row which goes into the map as value for that key.
When the statistics on this row have been updated i clear the row and go on with the tokenizer to obtain next row.
What is important - for some statistics I have to store not only the statistics values, but also every past value for every key for this value (imagine for example medians or amount of unique elements (here actually I dont need every past value but just unique - thus im using boost::unordered_sets))
4) When an entire file has been processed the malloc'd memory chunk is free'd and another file on this thread is read...

I am looking for some upgrades as I feel that it is still not as fast as I would like it to be... the files themselves which I read are not too big - something like 60-100k rows per file with 50 columns. I got around 5k of them or so. When using htop to monitor CPU usage i clearly see some bottlenecks, as at the beginning of runtime the usage for each core is around 30% and it slowly throughout the runtime of program increases and stabilizes around 65-90%, while there are regular noticeable drops (which prolly belong to the moments of reading new files)...

I am already reserving memory for vectors, for char * <-> float transformations I am already using fast ftoa and atof implementations. Not sure if my main bottleneck especially at the beginning is the disk (I am using HDD - not SSD)... or the RAM itself. Was thinking about mmaping a file instead of freading it, could it help? Give me any good ideas for improvements, plox!
(I have also tried the openmp - compiler i am using is GCC version 5.4 i believe)... Gieb some nice ideas plx :x
Member
Posts: 3,939
Joined: Feb 1 2013
Gold: 2,749.09
Warn: 20%
Oct 10 2016 12:34pm
why
Member
Posts: 5,567
Joined: Apr 30 2006
Gold: 1,020.00
Oct 10 2016 12:50pm
Quote (boxboxbox @ Oct 10 2016 06:34pm)
why


For speed ofc, doesn't matter tho - pls not looking for such answers ;)
Member
Posts: 3,939
Joined: Feb 1 2013
Gold: 2,749.09
Warn: 20%
Oct 10 2016 04:30pm
but why
Member
Posts: 32,925
Joined: Jul 23 2006
Gold: 3,804.50
Oct 10 2016 06:47pm
have you run a profiler to see duration by function? IMO it's not worth guessing at optimizations until you know what's taking up the time. eg if you find the data structures are taking up the time, you can try optimizing them; but i wouldn't drop all the maps until you know it's the bottleneck.

can you pre-process files?

can you set up a ram disk? if you can't use it for your main use case, you can at least use it to see if i/o is the problem.

Quote
where every task looks like this:
1) Open a file with statistics ... use malloc to allocate memory chunk of the file size and fread it into memory - > close the file.


you didn't mention which part of the app is faster/slower, the io or the processing. you can open your files ahead of time so your other threads aren't waiting on i/o. eg if you have 5 threads processing data at a given time, you can have 6 files opened in memory. as soon as one thread is done, the next data will already be ready instead of taking a break to read from io.

This post was edited by carteblanche on Oct 10 2016 06:59pm
Member
Posts: 5,567
Joined: Apr 30 2006
Gold: 1,020.00
Oct 10 2016 11:15pm
Quote (carteblanche @ Oct 11 2016 12:47am)
have you run a profiler to see duration by function? IMO it's not worth guessing at optimizations until you know what's taking up the time. eg if you find the data structures are taking up the time, you can try optimizing them; but i wouldn't drop all the maps until you know it's the bottleneck.

can you pre-process files?

can you set up a ram disk? if you can't use it for your main use case, you can at least use it to see if i/o is the problem.



you didn't mention which part of the app is faster/slower, the io or the processing. you can open your files ahead of time so your other threads aren't waiting on i/o. eg if you have 5 threads processing data at a given time, you can have 6 files opened in memory. as soon as one thread is done, the next data will already be ready instead of taking a break to read from io.


Ye, i was thinking of using ring buffer for this problem..
one of the bigger issue seems to be allocating memory at the beginning of runtime..
I will do some profiling with valgrind today and see if i can figure out something
Member
Posts: 6,953
Joined: Sep 27 2003
Gold: 518.50
Oct 13 2016 05:27pm
Suggestions:

1) Don't copy from the file to userspace. Use `mmap` and `madvise` with `MADV_SEQUENTIAL` to read things directly.

2) `strtok_r` is slower than you would expect. If you know the delimiters, it is faster to use `std::find_if`. The reason for this is the fact that `delim` is a `const char*`, so there are extra lookups than if you just say `c == ' ' || c == '\n'`.

3) Unless you need your data up-to-date in real-time, don't bother with the shared map. Give each thread its own map and then combine them at the end. This is called "map-reduce."
Go Back To Programming & Development Topic List
Add Reply New Topic New Poll