A Virtual Museum on the State's Fish Biodiversity

Edits to Dates, Collector Names and Determiner Names

Collection dates, collector names and determiner names were edited in a stepwise fashion as outlined in detail below. Two collection dates, a begin date and end date, were assigned to all records. Null dates were assigned conservative begin and end dates based on other information known about the collectors associated with the record. In some instances dates were recognized to be incorrect and were edited. Collector and determiner names were edited based largely on information contained in other records in the database. (more detail below)

Step 1: Parsing & Basic Cleaning of Collector Names

The verbatim collector names as received from our donors were typically in a string in a single field with multiple collectors' names in diverse and inconsistent formats. Names were often misspelled, frequently incomplete, and multiple names were inconsistently separated by various punctuation marks or not separated at all. In Microsoft Excel, a copy of the original verbatim collectors’ name strings were parsed to individual collector names (using a mix of automated formulae and manual methods) into separate columns, and names assigned a number corresponding to their position in the original text string. Individual collector names were then further parsed and re-ordered to put all in the order last name, first and middle name, prefix. The maximum number of collectors for any record was 8, so 8 new sets of name-related fields were constructed for each record. Most records contain far fewer than 8 collectors and thus many records have blank values for all but the first few collector fields.

Individual names were then compiled into a single column to facilitate sorting and synonymization of names. During this process names were corrected for spelling and partial names were replaced by full names when possible, but we were sometimes limited by our lack of familiarity with many of the collectors. The sometimes-used 'et al' in original strings was dropped from the collector string, but 'family', university class names, and other group identities were maintained as distinct collectors. After the diverse permutations of collector names were synonymized they were brought back together to their original positions (multiple collectors per record) for the subsequent steps.

Up to this point additional collector names were not assigned to any record where previously absent (the original informational content of the donor was maintained). This step only allowed cleaning of the collector names already associated with each record and cleaned the data to facilitate further editing in subsequent steps.

Step 2: Editing Collector Names

The database was sorted by the standardized locality name and collection date to approximate collecting events (true collecting events would require grouping by collector as well) and each approximated collecting event was then assigned a group number. Within each group, collectors were reviewed, and in cases of at least one collector being found in all records within the group, all collector names were applied to all records. Thus, for a given group, we assumed that if a record included collector A only, while another record from the group had collector A, B and C; then A, B and C all participated in the collecting event, and the first record therefore should also have collectors B and C. In many cases records with blank or 'unknown' collectors were assigned the collectors of the other records from other records in the group. Efforts were also made to consider the original locality description, and in cases where the georeferenced locality name was more general (e.g. Lake Travis) and the original locality name was more specific (e.g. Lake Travis at site A) collectors were only synonymized to the level of the original donor's verbatim location. During this process additional errors in collector name spellings, missed during step 1, were identified and corrected.

We identified 1,897 groups for review and we were able to reconcile 789 (42%) of those.

Step 3: Editing Non-null Dates

Dates were particularly troublesome, coming from donors in many formats, and sometimes with formats varying from record to record within single donor data sets. Order of month (sometimes as roman numerals, numeric or text), day and year often differed among collections (and sometimes records), and single dates were sometimes contained in a single field and other times were parsed into several fields (separate year, month, day). Furthermore, some institutions provided date ranges while others had only begin or end dates.

Using Microsoft Excel, all raw donor dates were reformatted as text in separate year, month, and day fields, manually verified against original donor data, and then preserved as 'verbatim dates' in the database. The database was sorted on our recently edited collector fields and our standardized locality name to approximate collecting events. Each approximated collecting event (true collecting events would require grouping by date as well) was then assigned a group number. Within each group, dates greater than two years apart were reviewed. Many of these groups could easily be determined to be discrete collection events occurring on different dates and thus apparently correct in the database, but in many cases we were able to confidently edit dates. In some rare cases this was a rather subjective act, but in general a conservative approach was taken and we tried to avoid making changes that could be controversial. Problems were often (but not always) identified and corrected when the following conditions were met: (1) within a group in which the day and month were the same among all records, but year differed for one; (2) where month or day was null but present in other records within the same group, or (3) when a date was not possible, such as a collection from the future or outside of a collector's lifetime. Not counting completely null dates (editing process is described in step 4), 118 track 1 records were confidently determined to have erroneous dates and we felt it appropriate to edit the verbatim date in some way.

Step 4: Editing Null Dates

To assure that all records could be retrieved by date-based queries, for all records lacking (= having 'null') dates (3.7% of the track 1 database; 2,996 records), we either determined actual collection dates or derived conservative estimates of date ranges that we were confident bracketed the actual collection dates. To do this the database was sorted on collector and georeferenced location to approximate collecting events and each approximated collecting event was then assigned a group number. Within each group, null dates (dates with day, month and year empty) were reviewed. All null dates were then converted to ranges determined using one or a combination of 4 methods: 1) For collectors we recognized as being older (before approximately 1930) we reviewed historic documentation about the collector found online, but this was done quickly and conservative estimates of date ranges were always applied. We hope users will keep this in mind as they use the data and report any possible refinements to us. 2) For collections made by collectors well-represented in the database we used the years of the date extents of their collections in our database. 3) For other records we set the date extent based on our personal knowledge of that person's collecting or determining activity and other evidence from the database. In many cases we were able to use determination year to define the upper date range, but usually these records have large date ranges. 4) In cases where none of the above was possible we defined the date range as 1830 (approximately 2 decades prior to the first date recorded in the database) to the last date in the data track.

In some instances the collectors verbatim field numbers, which typically include the date or at least year, could be used to extract accurate date information.

Step 5: Editing Determiner

The verbatim determiner names, as they were received from our donors were affected by the same issues that affected collectors' names, so were similarly processed by parsing into separate columns and assigning a number corresponding to their position in the text string. Individual determiner names were further parsed to the following four fields: determiner's position in verbatim text string, last name, first and middle name, and prefix. In the verbatim determiner field, most records contain only a single determiner or no determiner, but in the rare cases where two or more determiners are listed the last positioned determiner was taken to be the most recent determiner unless the multiple determiners were separated with 'and', in which case the first was assigned as the most recent determiner.

To facilitate correction of spelling and completion of incomplete determiner names, all determiner names were arranged in a single column along with all collector names (formatted identically and with previous edits for spelling and completeness) and the entire list was sorted on last name and first name to juxtapose similar names representing the same person. Determiner names (and in some instances collector names) were then examined in light of other similar determiner and collector entries and edited accordingly but conservatively for spelling and completeness (informational content was not edited).

Step 6: Date Outlier Detection & Editing

Efforts to correct collector's names allowed for a method to examine collection dates. A pivot table was created in Excel that counted the number of records for each collector across each of all collection years, thus allowing quick detection of date outliers for any collector. However, due to the large size of this table we facilitated finding outliers by extracting and focusing on only those collectors reported to have gaps in the distribution of their collecting activity of more than 10 years; a condition that we suspected to be likely indicative of errors in either collector name or date of collection. Using this method we identified 145 collector names in need of examination and of those we were able to correct 18 by editing either the date of collection or the collector name. We suspect numerous others of being legitimate errors, but could not find sufficient justification for changing the data from the verbatim original, nor did we have resources to explore this method at a finer temporal scale.