Examples

In this section, we provide comprehensive examples to demonstrate the workflow for processing different types of file sources using our schema-driven approach. These examples will guide you through the steps required to configure, execute, and evaluate the data based on specific file formats.

For each file source, we will cover the following key steps:

Running the Schema Generation Command: This step involves executing a command to generate a schema based on the specified file format. The schema will define how the data should be processed and evaluated.
Creating the Schema: We provide the schema configuration file, detailing the necessary settings and parameters for processing the data.
Description of the Data We Want to Search: This section includes a sample dataset in the specified file format that we aim to search and analyze.
File to Run the Schema Against: Here, we present another sample dataset against which the schema will be applied to find matches.
Expected Results: Finally, we show the expected output after running the schema against the sample data, demonstrating the effectiveness of the schema-driven approach.

By following these examples, you will gain a clear understanding of how to handle different file formats, configure the schema appropriately, and interpret the results of the evaluation. This will enable you to apply similar workflows to your own datasets, ensuring accurate and efficient data processing across various file sources.

Working with JSON data

To analyze and process JSON data effectively, we use a schema-driven approach. This process involves running a command to generate a schema based on JSON-like data and evaluating the similarity of data entries against the schema.

Running the Schema Generation Command

First, we execute a command to generate a schema from the JSON-like data. This schema will be used to evaluate and process the data further.

findanywhere_schema jsonlike string_based_evaluation \
--threshold constant \
--out schema.yml

This command generates a schema named schema.yml using the findanywhere_schema tool. The schema is based on the JSON-like data format and uses a string-based evaluation method. The –threshold constant option sets a constant threshold value for the evaluation process.

Creating the Schema

The schema configuration file defines how the data will be processed and evaluated. Below is an example of such a schema:

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: jaro_winkler
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    encoding: utf-8
    errors: surrogateescape
    file_format: !!python/object/apply:findanywhere.adapters.source.jsonlike.JSONLikeFormat
    - json
    tokenizer: delimiter
    tokenizer_config: {}
  name: jsonlike
threshold:
  config:
    constant: 0.8
  name: constant

This YAML configuration file includes several sections:

deduction: Defines the deduction method to use, which in this case is average.
evaluation: Specifies the evaluation method (string_based_evaluation) and its configuration, including the similarity measure (jaro_winkler) and aggregation method (max).
source: Describes the source data format and how it should be read, including encoding, error handling, and tokenizer configuration.
threshold: Sets the constant threshold value for evaluation.

Description of the Data We Want to Search

Here is the JSON data we aim to search. This dataset contains user information such as IDs, emails, first names, and last names.

[
    {
        "id": "alice.ashcroft", "email": "alice.ashcroft@here.local",
        "first_name": "Alice", "last_name": "Ashcroft"
    },
    {
        "id": "bob.bones", "email":  "charlie.st.claire@here.local",
        "first_name": "Bob", "last_name": "Bones"
    }
]

This dataset includes two user records:

Alice Ashcroft with ID alice.ashcroft and email alice.ashcroft@here.local
Bob Bones with ID bob.bones and email charlie.st.claire@here.local

File to Run the Schema Against

We also have a JSON file containing data entries we want to match against our schema.

[
    {"firstname": "Alice", "lastname": "Ashcroft", "order_no": 5},
    {"name": "Bones", "other": "Bob", "order_no": 6}
]

This dataset includes two records:

An entry with firstname “Alice”, lastname “Ashcroft”, and order_no 5.
An entry with name “Bones”, other “Bob”, and order_no 6.

Expected Results

After running the schema against the data file, we expect to obtain a result indicating the best matches based on the similarity evaluation. Below is an example of the expected output:

[
  {
    "of": "Alice",
    "best_matches": {
      "first_name": {
        "position": {"path": [0, "firstname"], "token": 0},
        "value": "Alice",
        "similarity": 1.0
      },
      "id": {
        "position": {"path": [0, "firstname"], "token": 0},
        "value": "Alice",
        "similarity": 1.0
      },
      "last_name": {
        "position": {"path": [0, "lastname"], "token": 0
        },
        "value": "Ashcroft",
        "similarity": 1.0
      }
    },
    "score": 1.0
  },
  {
    "of": "Bob",
    "best_matches": {
      "first_name": {
        "position": {"path": [1, "other"], "token": 0},
        "value": "Bob",
        "similarity": 1.0
      },
      "id": {
        "position": {"path": [1, "other"], "token": 0},
        "value": "Bob",
        "similarity": 1.0
      },
      "last_name": {
        "position": {"path": [1, "name"], "token": 0},
        "value": "Bones",
        "similarity": 1.0
      }
    },
    "score": 1.0
  }
]

The result includes:

A match for “Alice” with perfect similarity scores for first_name, id, and last_name.
A match for “Bob” with perfect similarity scores for first_name, id, and last_name.

This output demonstrates that the schema and evaluation process correctly identified and matched the relevant data fields from the input dataset.

Working with malformed CSV data

A “malformed CSV” refers to a CSV (Comma-Separated Values) file that deviates from the standard format where each line contains the same number of fields, separated consistently by a single specified delimiter, typically a comma. In the context of this document, a malformed CSV may exhibit one or more of the following anomalies:

Inconsistent Delimiters: Instead of using a consistent delimiter like a comma, the file uses varied delimiters (e.g., ‘##’, commas, spaces), or mixes multiple types of delimiters unpredictably across the data.

Irregular Number of Fields: Some rows contain different numbers of fields compared to others. This could include extra fields or missing fields, which may lead to misalignment when attempting to parse the data into a structured form.

Non-standard Encoding or Escaping: Issues in text encoding or improper escaping of special characters (like newline characters within a field) can cause further complications in parsing the file.

Such inconsistencies can lead to challenges in data processing and analysis, as standard CSV parsing tools or libraries may fail to correctly interpret or extract the data fields. Special handling and customized parsing strategies are often required to work effectively with malformed CSV files.

Assuming the following malformed CSV file

This section provides an example of a CSV file that does not adhere to a standard single-delimiter format, which can pose challenges during data processing:

name##age
alice##32
bob##24##blue
charlie##30##cyan##cheeseburger

Create initial schema

To begin processing the malformed CSV file, we first generate an initial schema using the findanywhere_schema tool. This tool is configured to evaluate strings in the file and create a schema based on the consistency of the data pattern it finds:

findanywhere_schema textfile string_based_evaluation \
--threshold constant \
--out schema.yml

Modify schema to reflect source file, treated as text with delimiter

Once the initial schema is created, it is necessary to modify it to accurately reflect the structure of our source file, particularly focusing on handling multiple delimiters as seen in our data. This adjusted schema uses the ‘##’ character as a delimiter and sets the threshold for matching:

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: token_best_fit_similarity
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    encoding: utf-8
    errors: surrogateescape
    tokenizer_config:
      delimiters:
        - "##"
  name: textfile
threshold:
  config:
    constant: 0.8
  name: constant

Data we want to search

The schema will be utilized to search for specific data entries from another JSON formatted file, as shown below. This file contains structured data such as email addresses and town names associated with identifiers:

[
  {"id": "alice", "email": "alice.ashcroft@here.local", "town":  "Ashville"},
  {"id": "charlie", "email":  "charlie.st.claire@here.local"}
]

findanywhere schema.yml search_for.json search_in.csv

Yields results

The results of the search are provided below, indicating how closely each piece of data matches the searched criteria. This output highlights anomalies and explains discrepancies found during the search:

{
    "of": "alice",
    "best_matches": {
        "email": {
            "position": {"line": 1, "token": 2},
            "value": "alice.ashcroft@here.local",
            "similarity": 1.0
        },
        "id": {
            "position": {"line": 1, "token": 0},
            "value": "alice.ashcroft",
            "similarity": 0.8714285714285714
        },
        "town": {
            "position": {"line": 1, "token": 1},
            "value": "5th Avenue Ashville",
            "similarity": 1.0
        }
    },
"score": 0.9571428571428572
}
{
    "of": "alice",
    "best_matches": {
        "email": {
            "position": {"line": 2, "token": 3},
            "value": " California,bob.bones@here.local\n",
            "similarity": 0.7280286738351255
        },
        "id": {
            "position": {"line": 2, "token": 1},
            "value": "Alice Ashcroft Memorial Lane",
            "similarity": 0.8666666666666667
        },
        "town": {
            "position": {"line": 2, "token": 2},
            "value": "Ashville Cyan County",
            "similarity": 1.0}
    },
    "score": 0.8648984468339308
}
{
    "of": "charlie",
    "best_matches": {
        "email": {
            "position": {"line": 3, "token": 1},
            "value": "charlie.st.claire@here.local",
            "similarity": 1.0
        },
        "id": {
            "position": {"line": 3, "token": 0},
            "value": "charlie.st.claire",
            "similarity": 0.8823529411764706}
        },
        "score": 0.9411764705882353
}

Working with xml data

In this section, we will cover the process of searching an XML file using a schema and evaluation method. The goal is to identify specific data within the XML file by defining a schema and applying a string-based evaluation method.

To start, we need an XML file that contains the data we want to search. We will also define the schema and the evaluation method to process the XML file.

We will be using the findanywhere_schema command to create a schema from the XML file and apply a string-based evaluation method. The –threshold parameter sets the matching threshold, and the –deduce_score parameter specifies how to calculate the overall score.

findanywhere_schema xmlfile string_based_evaluation --threshold constant --out schema.yml --deduce_score average

<<Result of schema generation>>

After running the command, the schema generation results will look like this. The schema defines how the XML data should be processed and evaluated.

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: jaro_winkler
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    include_attributes: true
    tokenizer: delimiter
    tokenizer_config: {}
  name: xmlfile
threshold:
  config:
    constant: 0.8
  name: constant

Here is the XML file that contains the library data. This file includes metadata about the library and details about the books available in the library.

<?xml version="1.0" encoding="UTF-8"?>
<Library>
    <Metadata>
        <LibraryName>Central City Library</LibraryName>
        <Location>123 Library St, Central City</Location>
        <Contact>Email: info@centrallibrary.com, Phone: 123-456-7890</Contact>
        <EstablishedYear>1902</EstablishedYear>
    </Metadata>
    <Books>
        <Book lent_to="Alice Ashcroft">
            <Title>The Great Gatsby</Title>
            <Author>F. Scott Fitzgerald</Author>
            <PublicationYear>1925</PublicationYear>
            <ISBN>9780743273565</ISBN>
            <Genre>Fiction</Genre>
        </Book>
        <Book>
            <Title>To Kill a Mockingbird</Title>
            <Author>Harper Lee</Author>
            <PublicationYear>1960</PublicationYear>
            <ISBN>9780061120084</ISBN>
            <Genre>Fiction</Genre>
        </Book>
        <Book>
            <Title>1984</Title>
            <Author>George Orwell</Author>
            <PublicationYear>1949</PublicationYear>
            <ISBN>9780451524935</ISBN>
            <Genre>Dystopian</Genre>
        </Book>
    </Books>
</Library>

We will use the following JSON data to search within the XML file. This data contains the names we want to find in the XML file.

[
  {
    "first_name": "Francis",
    "middle_name": "Scott",
    "last_name": "Fritzgerald"
  },
  {
    "first_name": "Alice",
    "last_name": "Ashcroft"
  }
]

To perform the search, we use the findanywhere command, specifying the schema file, the JSON data file, and the XML file.

findanywhere schema.yml search_data.json books.xml

The command returns the results of the search. The output indicates the best matches found in the XML file for each entry in the JSON data, along with their positions and similarity scores.

{
    "of": "0",
    "best_matches": {
        "first_name": {
            "position": {
                "tag": "title",
                "path": ["Library", "Books", "Book"],
                "index": 1,
                "token": 2
            },
            "value": "a",
            "similarity": 0.7142857142857143
        },
        "last_name": {
            "position": {
                "tag": "author",
                "path": ["Library", "Books", "Book"],
                "index": 0,
                "token": 2
            },
            "value": "Fitzgerald",
            "similarity": 0.9727272727272728
        },
        "middle_name": {
            "position": {
                "tag": "author",
                "path": ["Library", "Books", "Book"],
                "index": 0,
                "token": 1
            },
            "value": "Scott",
            "similarity": 1.0
        }
    },
    "score": 0.8956709956709957
}
{
    "of": "1",
    "best_matches": {
        "first_name": {
            "position": {
                "tag": "lent_to@book",
                "path": ["Library", "Books", "Book"],
                "index": 0,
                "token": 0
            },
            "value": "Alice",
            "similarity": 1.0
        },
        "last_name": {
            "position": {
                "tag": "lent_to@book",
                "path": ["Library", "Books", "Book"],
                "index": 0,
                "token": 1
            },
            "value": "Ashcroft",
            "similarity": 1.0
        }
    },
    "score": 1.0
}

Working with HTML data

findanywhere_schema website string_based_evaluation --threshold constant --out schema.yml --deduce_score average

The command above generates a schema file (schema.yml) based on the website website using the string_based_evaluation method with a constant threshold. The resulting schema is configured to deduce scores averaged across evaluations.

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: jaro_winkler
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    html_parser_name: html.parser
    tokenizer: delimiter
    tokenizer_config: {}
  name: website
threshold:
  config:
    constant: 0.8
  name: constant

The schema file (schema.yml) includes:

deduction: Configuration for averaging scores (average).
evaluation: Utilizes string_based_evaluation with parameters such as maximum aggregation and Jaro-Winkler similarity.
source: Specifies website as the source, using HTML parser html.parser with delimiter tokenization.
threshold: Sets a constant threshold of 0.8 for scoring.

[
  {
    "keyword": "Example"
  },
  {
    "reference": "Information"
  }
]

In the JSON above, we specify our search criteria for https://example.com or a local file. We are interested in extracting information related to the keyword “Example” and the reference “Information”.

findanywhere schema.yml search_data.json http://example.com --sequential 1

Executing the command above (findanywhere), we utilize the schema (schema.yml) and search data (search_data.json). Given that website multiprocessing isn’t required, we opt for sequential mode (–sequential 1).

{
    "of": "0",
    "best_matches": {
        "keyword": {
            "position": {
                "tag": "h1",
                "path": ["html", "body", "div"],
                "index": 0,
                "token": 0
            },
            "value": "Example",
            "similarity": 1.0
        }
    },
    "score": 1.0
}
{
    "of": "1",
    "best_matches": {
        "reference": {
            "position": {
                "tag": "p",
                "path": ["html", "body", "div"],
                "index": 25,
                "token": 1
            },
            "value": "information...",
            "similarity": 0.8744588744588745
        }
    },
    "score": 0.8744588744588745
}
{
    "of": "1",
    "best_matches": {
        "reference": {
            "position": {
                "tag": "a",
                "path": ["html", "body", "div", "p"],
                "index": 1,
                "token": 1
            },
            "value": "information...",
            "similarity": 0.8744588744588745
        }
    },
    "score": 0.8744588744588745
}
{
    "of": "0",
    "best_matches": {
        "keyword": {
            "position": {
                "tag": "title",
                "path": ["html", "head"],
                "index": 0,
                "token": 0
            },
            "value": "Example",
            "similarity": 1.0
        }
    },
    "score": 1.0
}

After execution, the results in Python format provide details on matched elements (tag, path, index, token), matched values (value), similarity scores (similarity), and overall scores (score). These results are obtained for both keywords and references specified in the search data.