go_away

Author Topic: best way to process huge data files in C  (Read 1802 times)

0 Members and 1 Guest are viewing this topic.

Offline AdminTopic starter

  • Administrator
  • Supreme Robot
  • *****
  • Posts: 11,632
  • Helpful? 169
    • Society of Robots
best way to process huge data files in C
« on: August 06, 2009, 10:12:05 AM »
I have these ginormous data files that I currently hand process using Excel. I'm trying to automate the dataprocessing by writing a program in C (for a PC).

The issue is that although I probably have the skills to do it, I'm looking for expert programmer advice on the best way to do it.

A datafile looks something like this below. Full datafiles are attached. I can use any file format, like txt.
Code: [Select]
System Warming Up...........Initialization Complete

9 0 112 176 0 0 0 0 0


speed=1

max_bulk=20

155 0 113 177 0 0 0 0 0
286 0 112 179 3 3 3 3 78
279 0 112 178 0 0 0 0 0
270 0 112 176 2 2 2 3 154
271 0 113 176 3 3 3 3 95
273 0 114 176 1 1 1 1 55
279 0 115 178 1 1 1 1 7
286 0 113 177 0 0 0 0 0
293 0 110 179 3 3 3 3 21
298 0 109 178 1 1 1 1 86
437 0 109 180 1 1 1 1 72
288 0 112 179 1 1 1 1 98
279 0 115 176 0 0 0 0 0
271 0 117 174 3 3 3 3 134
271 0 117 176 2 2 2 2 87
272 0 116 177 1 1 1 1 62

speed=3

max_bulk=60

155 0 113 177 0 0 0 0 0
283 0 112 177 2 2 1 1 1
283 0 112 177 1 1 1 1 3
286 0 112 176 0 0 0 0 2
274 0 112 176 2 2 3 3 131
277 0 113 177 0 0 0 0 0
278 0 113 176 0 0 0 0 0
267 0 112 177 3 3 3 3 96
269 0 113 177 1 1 1 1 47
272 0 113 176 0 0 0 0 62
266 0 112 177 2 2 2 2 72
266 0 111 177 0 1 1 1 34
269 0 112 178 0 0 0 0 29
268 0 112 177 1 1 1 1 52
269 0 112 178 0 0 0 0 27
272 0 113 178 0 0 0 0 0
274 0 113 176 2 2 2 2 12
275 0 112 177 0 0 0 0 0
278 0 111 179 0 0 0 0 3
283 0 111 178 1 1 1 1 1
283 0 111 178 1 2 1 1 8
286 0 111 177 0 0 0 0 1
289 0 112 175 3 3 3 3 93
291 0 112 177 1 1 1 1 44
295 0 112 177 0 0 0 0 0
292 0 112 177 2 2 2 2 67
293 0 112 178 0 0 0 0 28
298 0 112 177 0 0 0 0 0
289 0 112 176 1 2 2 2 61
291 0 112 178 0 0 0 0 1
436 0 113 178 0 0 0 0 11
284 0 114 176 2 1 1 1 2
283 0 112 176 1 1 0 0 2
286 0 112 176 0 0 0 0 1
274 0 111 177 2 2 2 2 130
277 0 111 177 1 1 1 1 1
278 0 112 177 0 0 0 0 0
267 0 110 177 3 3 3 3 97
268 0 112 176 1 1 1 1 45
273 0 113 176 0 0 0 0 57
266 0 112 177 2 2 2 2 67


Basically, I need to do this:
- find average for each column, per speed, per bulk, in both data files
- multiply each averaged column value by another value
- subtract column value in Water datafile by equivalent column value in Air datafile
- output the result in a text file, per speed and per bulk value


Whats the best way to do this? My first thought was to create a huge matrix, but considering the file size, not sure on memory limitations . . .

Offline wil.hamilton

  • Robot Overlord
  • ****
  • Posts: 207
  • Helpful? 6
  • rtfm
Re: best way to process huge data files in C
« Reply #1 on: August 06, 2009, 10:51:00 AM »
do this have to be done in c?

there are much easier ways to do this if you don't do it in c  (assuming you are on windows) (easy to write something in vbscript to do this)
if you know java and c and have seen visual basic you could figure it out in no time

edit: my job is writing code to do stuff with excel files ( i use c# but you really need visual studio to write c#, i've written things to manipulate excel files using vbscript)
« Last Edit: August 06, 2009, 10:52:37 AM by wil.hamilton »
use the google.  it's your friend.

Offline GearMotion

  • Supreme Robot
  • *****
  • Posts: 489
  • Helpful? 24
  • Two decades+ of Embedded Design
Re: best way to process huge data files in C
« Reply #2 on: August 06, 2009, 11:02:36 AM »
This would be perfect to write in AWK (gawk), a free command-line utility language. I used to do this all the time.

Offline Razor Concepts

  • Supreme Robot
  • *****
  • Posts: 1,856
  • Helpful? 53
Re: best way to process huge data files in C
« Reply #3 on: August 06, 2009, 11:13:30 AM »
I wrote a program in java, did the entire water file in 312 milliseconds

Code: [Select]
   import java.io.*;
   import java.util.*;
   
    public class Process
   {
       public static void main(String[] args) throws Exception
      {
         int count = 0;
         int[] data = new int[9];
         int pos = 0;
         int speed = 2;
      int bulk = 20;
      long time;
      time = System.currentTimeMillis();
         Scanner s = new Scanner(new File("data.txt"));
         while(s.hasNextLine())
         {
            count++;
            String line = s.nextLine();
            if(line.length() == 0)
               break;
            if(line.substring(0,1).compareTo("s") == 0)
            {
            System.out.println("Speed: " + speed + " Bulk: " + bulk);
               for(int i = 0; i < 9; i++)
               {
            System.out.print(data[i]/count + "\t");
            }
            System.out.println();
            System.out.println();
            count = 0;
            data = new int[9];
            speed = Integer.parseInt(line.substring(6,7));
            if(speed == 0)
            {
            System.out.println();
      System.out.println(System.currentTimeMillis() - time);
            return;
            }
               String b = s.nextLine();
               bulk = Integer.parseInt(b.substring(9,11));
               line = s.nextLine();
            }
            StringTokenizer st = new StringTokenizer(line);
            while (st.hasMoreTokens())
            {
               int i = Integer.parseInt(st.nextToken());
               data[pos] += i;
               pos++;
            }
            pos = 0;
         }
       
      }
   }

The output:
Code: [Select]
----jGRASP exec: java Process

Speed: 2 Bulk: 20
287 114 172 0 0 0 0 33 0

Speed: 3 Bulk: 20
284 114 171 0 0 0 0 25 0

Speed: 4 Bulk: 20
282 114 171 0 0 0 0 23 0

Speed: 5 Bulk: 20
281 114 171 0 0 0 0 15 0

Speed: 2 Bulk: 40
287 114 180 0 0 0 0 41 0

Speed: 3 Bulk: 40
284 114 176 0 0 0 0 39 0

Speed: 4 Bulk: 40
283 114 175 0 0 0 0 36 0

Speed: 5 Bulk: 40
282 114 176 0 0 0 0 28 0

Speed: 2 Bulk: 60
287 115 195 0 0 0 0 34 0

Speed: 3 Bulk: 60
284 114 179 0 0 0 0 39 0

Speed: 4 Bulk: 60
283 113 177 0 0 0 0 37 0

Speed: 5 Bulk: 60
282 114 177 0 0 0 0 30 0

Speed: 2 Bulk: 80
287 115 200 0 0 0 0 32 0

Speed: 3 Bulk: 80
284 113 184 0 0 0 0 31 0

Speed: 4 Bulk: 80
283 113 179 0 0 0 0 34 0

Speed: 5 Bulk: 80
282 114 179 0 0 0 0 35 0


312

Has some little bugs (accidently added another data point, 9 points instead of 8.. oh well), but for small data sizes like 11000 points it doesn't really take much time  :)

Offline AdminTopic starter

  • Administrator
  • Supreme Robot
  • *****
  • Posts: 11,632
  • Helpful? 169
    • Society of Robots
Re: best way to process huge data files in C
« Reply #4 on: August 06, 2009, 12:00:50 PM »
Whoa thanks Razor! I'm glad you guys understood the question, wasn't sure if I explained it properly.

Ok two questions . . . after finding the averages but before printing out the results,

how do I do this step?
- multiply each averaged column value by another value

then this step?
- subtract column value in Water datafile by equivalent column value in Air datafile


(ps - I suck at Java :P)

Offline wil.hamilton

  • Robot Overlord
  • ****
  • Posts: 207
  • Helpful? 6
  • rtfm
Re: best way to process huge data files in C
« Reply #5 on: August 06, 2009, 01:48:57 PM »
how do I do this step?
- multiply each averaged column value by another value

then this step?
- subtract column value in Water datafile by equivalent column value in Air datafile

the program would have to probably be modified to store the data in a 2d array, and then multiply, or ,just multiply it before it prints it out

to do both values,
it would have to read the one file, store it in a 2d array, then read the other file and process it doing the subtraction from the 2d array

(ps - I suck at Java :P)

didn't you have to take 15-100?

(i'm a current cmu student, we've met)
use the google.  it's your friend.

Offline AdminTopic starter

  • Administrator
  • Supreme Robot
  • *****
  • Posts: 11,632
  • Helpful? 169
    • Society of Robots
Re: best way to process huge data files in C
« Reply #6 on: August 06, 2009, 05:05:21 PM »
didn't you have to take 15-100?

(i'm a current cmu student, we've met)
When I took 15-100 to 15-whatever, it was all taught using C++. They moved to Java the year after my CS classes. Probably a good thing, as C++ is more similar to C, which is great for microcontrollers. ;D

I had to use Java once for a class, but don't really remember any of it.

I graduated in 2004 . . . met where?

Offline awally88

  • Robot Overlord
  • ****
  • Posts: 212
  • Helpful? 0
Re: best way to process huge data files in C
« Reply #7 on: August 09, 2009, 06:34:50 PM »
Tokenizer? I prefer Scanner for processing data but Tokenizer is ok. If you call

Scanner cons = new Scanner(System.in) you can just redirect the stream to the program(java filename < datafile)


Then to import data you can use
lineoftext = cons.nextLine();
Scanner line = new Scanner(lineoftext)
line.nextInt() or line.next(); <---- Means you don't have to keep casting everything!

Offline Samuel

  • Jr. Member
  • **
  • Posts: 18
  • Helpful? 0
Re: best way to process huge data files in C
« Reply #8 on: August 10, 2009, 06:42:50 AM »
Hi Admin,

I am enclosing a zip file that has in it a RobotBASIC program that (I think) achieves your requirements.

Note: You can specify the multipliers for each column for each file within the program code at the top of the program. See comments.

Note: you can specify the names of the data input files and the results file within the program...see the comments at the top.

Note: Your Excel sheet had one less column than there was in the Raw data. My program maintains the column in the raw data. So my program has 9 columns....your excel sheet has 8. So the output results file has 9 columns even though the second column seems to be always 0 anyway.

Note: since there are 9 columns you have to specify 9 multipliers.

Note: The output file also has a date and time stamp

Note: The program assumes that for each speed-bulk block there is a corresponding one in the other file in the same relative position to the rest of the data. So if speed 2 bulk 20 is the first lot and speed 5 bulk 40 is the second lot in the air file then the same has to be in the water file.
The output file has the values for each speed-bulk values from each file. This way you can compare them to see if there is an error.

The program is commented and you should be able to easily change it.

Here is the resultant output from running the program on the data files you posted. It took 78.5 seconds to complete and I used 1 for all multipliers.
Code: [Select]
Results for "data_file_water.txt" and "data_file_air.txt
Processed on Mon. August 10,2009 08:10:27 AM

Water:speed=2  max_bulk=20         Air:speed=2  max_bulk=20
0.03    0.00    0.17    -4.49   -0.02   -0.02   -0.04   0.00    2.20    
Water:speed=3  max_bulk=20          Air:speed=3  max_bulk=20
-0.06   0.00    0.12    -5.80   0.00    0.00    -0.01   -0.04   -3.82  
Water:speed=4  max_bulk=20          Air:speed=4  max_bulk=20
0.01    0.00    0.09    -5.65   -0.01   -0.01   -0.01   -0.01   0.53    
Water:speed=5  max_bulk=20          Air:speed=5  max_bulk=20
-0.06   0.00    0.08    -5.75   0.01    0.01    0.01    0.01    -2.93  
Water:speed=2  max_bulk=40          Air:speed=2  max_bulk=40
-0.33   0.00    0.78    3.29    0.04    0.02    0.02    0.05    6.03    
Water:speed=3  max_bulk=40          Air:speed=3  max_bulk=40
0.02    0.00    0.01    -1.09   -0.02   -0.02   -0.01   0.00    1.31    
Water:speed=4  max_bulk=40          Air:speed=4  max_bulk=40
0.09    0.00    -0.01   -1.40   -0.03   -0.02   -0.02   0.00    3.88    
Water:speed=5  max_bulk=40          Air:speed=5  max_bulk=40
0.05    0.00    0.10    -1.25   0.01    0.01    0.02    0.05    1.49    
Water:speed=2  max_bulk=60          Air:speed=2  max_bulk=60
-0.13   0.00    1.67    17.89   -0.08   -0.07   -0.06   -0.03   5.51    
Water:speed=3  max_bulk=60          Air:speed=3  max_bulk=60
-0.32   0.00    0.02    2.52    -0.06   -0.06   -0.05   -0.03   2.33    
Water:speed=4  max_bulk=60          Air:speed=4  max_bulk=60
0.01    0.00    -0.04   0.72    -0.01   -0.01   -0.01   -0.01   1.49    
Water:speed=5  max_bulk=60          Air:speed=5  max_bulk=60
-0.05   0.00    0.09    0.63    -0.03   -0.02   -0.01   -0.02   1.59    
Water:speed=2  max_bulk=80          Air:speed=2  max_bulk=80
0.21    0.00    1.79    23.34   0.11    0.11    0.12    0.12    9.24    
Water:speed=3  max_bulk=80          Air:speed=3  max_bulk=80
-0.14   0.00    0.02    6.81    0.00    -0.01   -0.03   -0.03   3.99    
Water:speed=4  max_bulk=80          Air:speed=4  max_bulk=80
-0.25   0.00    0.03    2.59    -0.02   -0.02   -0.02   0.00    1.01    
Water:speed=5  max_bulk=80          Air:speed=5  max_bulk=80
0.05    0.00    0.17    1.97    -0.01   -0.02   -0.02   -0.04   4.70    
« Last Edit: August 10, 2009, 06:48:46 AM by Samuel »

Offline wil.hamilton

  • Robot Overlord
  • ****
  • Posts: 207
  • Helpful? 6
  • rtfm
Re: best way to process huge data files in C
« Reply #9 on: August 11, 2009, 06:47:24 PM »
didn't you have to take 15-100?

(i'm a current cmu student, we've met)
When I took 15-100 to 15-whatever, it was all taught using C++. They moved to Java the year after my CS classes. Probably a good thing, as C++ is more similar to C, which is great for microcontrollers. ;D

I had to use Java once for a class, but don't really remember any of it.

I graduated in 2004 . . . met where?

mobot 2008 (14th), we were both competitors, we talked for a few minutes between heats, i'm still a student there
use the google.  it's your friend.

Offline AdminTopic starter

  • Administrator
  • Supreme Robot
  • *****
  • Posts: 11,632
  • Helpful? 169
    • Society of Robots
Re: best way to process huge data files in C
« Reply #10 on: August 19, 2009, 07:18:48 AM »
thanks guys

I put my coworker on to it since he has java experience, and the sample java code definitely helped him out.

mobot 2008 (14th), we were both competitors, we talked for a few minutes between heats, i'm still a student there
which bot was yours?

Offline SmAsH

  • Supreme Robot
  • *****
  • Posts: 3,959
  • Helpful? 75
  • SoR's Locale Electronics Nut.
Re: best way to process huge data files in C
« Reply #11 on: August 19, 2009, 04:11:36 PM »
mobot 2008 (14th), we were both competitors, we talked for a few minutes between heats, i'm still a student there

which bot was yours?
http://www.societyofrobots.com/robotforum/index.php?topic=8651.msg67469#msg67469
Howdy

Offline sonictj

  • Supreme Robot
  • *****
  • Posts: 416
  • Helpful? 11
Re: best way to process huge data files in C
« Reply #12 on: August 19, 2009, 04:38:49 PM »
@ smash
The video you linked to is mobot 2009.  The competition spoken of was 2008.

Offline SmAsH

  • Supreme Robot
  • *****
  • Posts: 3,959
  • Helpful? 75
  • SoR's Locale Electronics Nut.
Re: best way to process huge data files in C
« Reply #13 on: August 20, 2009, 12:21:25 AM »
damn, my bad...
Howdy

 


Get Your Ad Here

data_list