How to Read in a Column From Excel as One List Instead of Many Python

article header image

Introduction

With pandas it is easy to read Excel files and convert the data into a DataFrame. Unfortunately Excel files in the real world are often poorly constructed. In those cases where the data is scattered across the worksheet, you may need to customize the way you read the data. This article will hash out how to apply pandas and openpyxl to read these types of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.

The Problem

The pandas read_excel office does an excellent job of reading Excel worksheets. However, in cases where the data is not a continuous tabular array starting at cell A1, the results may not be what you expect.

If y'all try to read in this sample spreadsheet using read_excel(src_file) :

Excel

You will become something that looks similar this:

Excel

These results include a lot of Unnamed columns, header labels inside a row as well equally several actress columns we don't need.

Pandas Solutions

The simplest solution for this information set is to use the header and usecols arguments to read_excel() . The usecols parameter, in detail, can be very useful for controlling the columns yous would like to include.

If y'all would similar to follow along with these examples, the file is on github.

Here is one alternative approach to read simply the information we demand.

                            import              pandas              as              pd              from              pathlib              import              Path              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              'B:F'              )            

The resulting DataFrame just contains the information we need. In this example, we purposely exclude the notes cavalcade and appointment field:

Clean DataFrame

The logic is relatively straightforward. usecols can accept Excel ranges such every bit B:F and read in only those columns. The header parameter expects a single integer that defines the header column. This value is 0-indexed so we pass in one even though this is row 2 in Excel.

In some instance, we may want to define the columns as a list of numbers. In this case, we could ascertain the list of integers:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              [              one              ,              ii              ,              3              ,              iv              ,              5              ])            

This approach might be useful if yous have some sort of numerical pattern you desire to follow for a large information set (i.eastward. every 3rd cavalcade or merely fifty-fifty numbered columns).

The pandas usecols can also take a list of column names. This code will create an equivalent DataFrame:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              [              'item_type'              ,              'order id'              ,              'order date'              ,              'country'              ,              'priority'              ])            

Using a list of named columns is going to be helpful if the column order changes but you know the names will non change.

Finally, usecols can take a callable function. Here's a uncomplicated long-form case that excludes unnamed columns as well as the priority column.

                            # Define a more circuitous function:              def              column_check              (              ten              ):              if              'unnamed'              in              x              .              lower              ():              render              False              if              'priority'              in              ten              .              lower              ():              return              Fake              if              'order'              in              x              .              lower              ():              return              True              return              True              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              column_check              )            

The key concept to keep in mind is that the function will parse each column past name and must return a True or False for each column. Those columns that get evaluated to True will be included.

Some other approach to using a callable is to include a lambda expression. Hither is an example where nosotros want to include simply a divers listing of columns. We normalize the names by converting them to lower example for comparing purposes.

                            cols_to_use              =              [              'item_type'              ,              'order id'              ,              'guild date'              ,              'state'              ,              'priority'              ]              df              =              pd              .              read_excel              (              src_file              ,              header              =              ane              ,              usecols              =              lambda              x              :              10              .              lower              ()              in              cols_to_use              )            

Callable functions requite us a lot of flexibility for dealing with the real earth messiness of Excel files.

Ranges and Tables

In some cases, the data could exist even more obfuscated in Excel. In this example, we take a table chosen ship_cost that we want to read. If you must work with a file similar this, it might be challenging to read in with the pandas options we have discussed so far.

Excel table

In this case, nosotros tin use openpyxl directly to parse the file and convert the data into a pandas DataFrame. The fact that the data is in an Excel table can make this process a little easier.

Hither's how to utilize openpyxl (once information technology is installed) to read the Excel file:

                            from              openpyxl              import              load_workbook              import              pandas              as              pd              from              pathlib              import              Path              src_file              =              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              wb              =              load_workbook              (              filename              =              src_file              )            

This loads the whole workbook. If we want to see all the sheets:

['sales', 'shipping_rates']          

To access the specific sheet:

                            canvass              =              wb              [              'shipping_rates'              ]            

To see a list of all the named tables:

dict_keys(['ship_cost'])          

This key corresponds to the name we assigned in Excel to the table. At present we access the tabular array to get the equivalent Excel range:

                            lookup_table              =              sheet              .              tables              [              'ship_cost'              ]              lookup_table              .              ref            
'C8:E16'          

This worked. We now know the range of data we want to load. The concluding step is to catechumen that range to a pandas DataFrame. Here is a curt code snippet to loop through each row and catechumen to a DataFrame:

                            # Admission the information in the tabular array range              data              =              sail              [              lookup_table              .              ref              ]              rows_list              =              []              # Loop through each row and get the values in the cells              for              row              in              data              :              # Get a list of all columns in each row              cols              =              []              for              col              in              row              :              cols              .              append              (              col              .              value              )              rows_list              .              append              (              cols              )              # Create a pandas dataframe from the rows_list.              # The first row is the column names              df              =              pd              .              DataFrame              (              data              =              rows_list              [              1              :],              index              =              None              ,              columns              =              rows_list              [              0              ])            

Here is the resulting DataFrame:

Excel shipping table

Now we have the make clean table and can utilize for farther calculations.

Summary

In an ideal globe, the data we use would be in a simple consistent format. Come across this newspaper for a nice discussion of what good spreadsheet practices look like.

In the examples in this article, you could easily delete rows and columns to make this more well-formatted. Withal, in that location are times where this is not feasible or advisable. The good news is that pandas and openpyxl give u.s. all the tools we need to read Excel information - no matter how crazy the spreadsheet gets.

Changes

  • 21-Oct-2020: Clarified that we don't want to include the notes column

arrowoodhies1995.blogspot.com

Source: https://pbpython.com/pandas-excel-range.html

0 Response to "How to Read in a Column From Excel as One List Instead of Many Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel