TechHui

Hawaiʻi's Technology Community

Importing large data sets in Rails and Postgres

In one of my recent projects, I needed to import a lot of data from csv files into a database as persisted domain objects in a web application.  This file import functionality wasn't just for application bootstrapping or a one-time legacy data migration, but for a recurring task that allowed the web app to receive updates from a third party vendor.

The number of records in each file could reach into the millions, so I knew from the outset that the plain-vanilla method of creating Rails ActiveRecord instances one at a time, was probably not going to cut it.

I implemented the data import functionality according to two approaches. The first was to use the smarter_csv and activrecord_import gems, to read records in batches from a csv file and import those batches into the database in chunks.  The advantage of this method is that the code is pure Ruby, and total portability is assured.  The activerecord_import and smarter_csv gems, are optimized for speed, so performance is also very good within the context of running within a framework like Rails.  However, ActiveRecord has so much meta-programming magic and so many included modules,not to mention validations and callbacks, that it adds a lot of extra cpu time to any large import task.  We can do even better if we avoid it completely.

The second approach does just that.  It uses the native Potgres COPY command, and avoids ActiveRecord entirely.  This approach sacrifices portability for speed.  If I needed to swap out Potgres in favor of another RDBMS system, I would have to adapt the code to whatever native COPY command that RDBMS system has, or fall back to the first approach.  I don't consider portability to be a binary trait, so I tend to favor mostly portable solutions, and make exceptions for areas where breaking portability is warranted.  In case I need to transition to a new database vendor, having maintained a high degree of portability, means very few code changes. So, partial credit applies here, and partial portability provides most of the code re-use benefit of total portability, and allows for better performance when needed.

I benchmarked these two approaches using a scenario in which I need to import 1 million records from a single csv file.  For reference, I also implemented a basic approach of reading one entry a time and saving it to the database, as a third benchmark comparison.  I imported 1 million records using the two non-basic approaches.  I only imported 1000 records using the basic approach due to its slow speed, but extrapolated the time results to infer the time for  1 million records.

The running time results were as follows:

  • one at a time, 1000 records; time to import: 13.288410138; total inferred time for 1 million recods: 13,288 seconds, or 3.69 hours
  • in batches with activerecord_import and smart_csv: 1 million records, total time : 391.37 seconds
  • via copy command, 1 million records; total time: 118.55 seconds

The results are pretty conclusive.  The basic approach is about 40 times slower than the batch approach, and the batch approach is about 3 times slower than the native copy command approach.  And, while these last two running times, may seem similar enough in comparison to that for the first approach, a factor of 3, can make a difference between an application being able to run real time, and not.

In this case, breaking convention, and using a high-performance custom approach, is well worth the risk of limited future portability issues.

Code snippets follow:

Import via Postgres copy command:

 1|conn = ActiveRecord::Base.connection
 2|rc = conn.raw_connection
 3|columns = [:name,:description, :cost, :price, :quantity, :sale, :size]
 4|#rc.exec("COPY large_table (col1, col2, etc) FROM STDIN WITH CSV") #sample COPY usage
 5|started_time = Time.now
 6|rc.transaction do
 7|    rc.exec("COPY fake_data_points (#{columns.map(&:to_s).join(', ')}) FROM STDIN")

 8|    file = File.open('/tmp/fake_reporting_data.csv', 'r')
 9|    index = 0
10|   while !file.eof?
11|        rc.put_copy_data(file.readline)
12|    end
13|    rc.put_copy_end

14|    #show error messages
15|    while res = rc.get_result       
16|        print e_message if e_message = res.error_message 
17|    end

18|end
19|finished_time = Time.now
20|puts "time to import: #{finished_time-started_time}"

Line 7 sets the template for all actual data copy executions, which occur on line 11, one per line in the csv file.  Lines 14-17 run boilerplate Potgres specific code for retrieving errors.

Import via single instance at a time creation:

 1|index = 0
 2|time_to_import = Benchmark.realtime do
 3|    CSV.foreach('/tmp/fake_reporting_data.csv', {:col_sep => "\t", headers: true}) do |row|
 4|        break if index > 1000 #get results for 1000 records, and extraplote to a million records
 5|        begin
 6|            hash = Hash[row.to_hash.map{|k,v|[k.underscore.to_sym,v]}]
 7|            FakeDataPoint.create!()
 8|            index += 1
 9|        rescue Exception
10|        end
11|    end
12|end
13|puts "time to import: #{time_to_import}"

The only thing worth pointing out here is that line 6 turns camel-cased csv headers like StockQuantity, to symbols for active record fields in the FakeDataPoint model, like :stock_quantity.

Import via activerecord_import and smart_csv gems:

 1|require "csv"
 2|require "benchmark"
 3|columns = [:name,:description, :cost, :price, :quantity, :sale, :size]
 4|total_time = 0
 5|index = 0

 6|time_to_transform = Benchmark.realtime do

 7|    SmarterCSV.process('/tmp/fake_reporting_data.csv', {:chunk_size => 10000, :col_sep => "\t"}) do |chunk|
 8|        transformed_chunk = chunk.map{|row|columns.map{|attr|row[attr]}}
 9|        FakeDataPoint.import(columns, transformed_chunk, validate: false)|
10|    end

11|end
12|puts "total time to import: #{total_time}"

Here, I import in chunks of 10,000 lines.  On line 8, I create an array with the values in a line of the csv file, in the order in which their corresponding keys are in the columns array.  The activerecord_api import function that I call on line 9, expects an array of arrays, where the inner arrays contain record attributes according to the first argument given to the function.

I've used the standard ruby benchmark library in these snippets.  I can cut down on cruft, by avoiding the recording of start and end times, and the subtraction between the two, at the cost of including the code to benchmark in a Ruby code block.

Views: 3540

Comment

You need to be a member of TechHui to add comments!

Join TechHui

Sponsors

web design, web development, localization

© 2024   Created by Daniel Leuck.   Powered by

Badges  |  Report an Issue  |  Terms of Service